Skip to content

Commit 48335fa

Browse files
committed
feat: add extra_fields support for custom CSV columns beyond OHLCV (#25)
Allow scripts to access additional CSV columns beyond standard OHLCV data via lib.extra_fields dict, which is updated each bar by ScriptRunner. - DataConverter generates .extra.csv sidecar file during CSV-to-OHLCV conversion, position-aligned with the binary file (including gaps) - OHLCVReader loads sidecar automatically and attaches extra_fields to each OHLCV record - LibrarySeriesTransformer excludes extra_fields from Series conversion (NON_SERIES_LIB_ATTRS) so dict subscript access works correctly - Add documentation and 4 tests covering end-to-end usage, Series history indexing, sidecar generation, and backward compatibility
1 parent a4f0553 commit 48335fa

File tree

10 files changed

+430
-9
lines changed

10 files changed

+430
-9
lines changed

docs/advanced/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,3 +23,4 @@ Advanced topics and features of PyneCore
2323
- [Function Isolation](./function-isolation.md) - Function isolation implementation
2424
- [OHLCV Reader/Writer](./ohlcv-reader-writer.md) - OHLCV data handling
2525
- [CSV Reader/Writer](./csv-reader-writer.md) - Fast CSV processing
26+
- [Extra Fields](./extra-fields.md) - Custom CSV columns beyond OHLCV in scripts

docs/advanced/csv-reader-writer.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -252,7 +252,7 @@ When working with financial data in PyneCore, you have two main options:
252252
- Data interchange with other systems
253253
- Human inspection and editing
254254
- Flexible schema requirements
255-
- When additional fields beyond OHLCV are needed
255+
- When additional fields beyond OHLCV are needed (see [Extra Fields](./extra-fields.md))
256256

257257
Both systems are designed for performance while maintaining pure Python implementation, aligning with the PyneCore project vision.
258258

docs/advanced/extra-fields.md

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
<!--
2+
---
3+
weight: 1005
4+
title: "Extra Fields"
5+
description: "Accessing custom CSV columns beyond OHLCV data in Pyne scripts"
6+
icon: "playlist_add"
7+
date: "2025-03-31"
8+
lastmod: "2026-03-15"
9+
draft: false
10+
toc: true
11+
categories: ["Advanced", "Data Handling"]
12+
tags: ["extra-fields", "csv", "custom-data", "series", "data"]
13+
---
14+
-->
15+
16+
# Extra Fields
17+
18+
PyneCore allows you to access additional columns beyond standard OHLCV data from your CSV files inside Pyne scripts. This is useful when your data includes pre-computed indicators, signals, or any other custom data that you want to use alongside price data.
19+
20+
## How It Works
21+
22+
When a CSV file contains columns beyond the standard OHLCV fields (`timestamp`, `open`, `high`, `low`, `close`, `volume`), PyneCore automatically makes them available through `extra_fields` — a dictionary that is updated on each bar with the current row's extra column values.
23+
24+
The data flow depends on how you run your script:
25+
26+
### Binary OHLCV Path (workdir)
27+
28+
When running from a workdir with `pyne run`, PyneCore converts CSV to binary `.ohlcv` format. Since the binary format only stores OHLCV data, extra columns are saved to a **sidecar file** (`.extra.csv`) that is position-aligned with the binary data:
29+
30+
```
31+
workdir/data/my_data/
32+
EURUSD_1h.csv # Source: OHLCV + extra columns
33+
EURUSD_1h.ohlcv # Binary OHLCV (auto-generated)
34+
EURUSD_1h.toml # Symbol metadata (auto-generated)
35+
EURUSD_1h.extra.csv # Extra columns only (auto-generated)
36+
```
37+
38+
The sidecar file is generated and regenerated automatically whenever the source CSV is converted. You never need to create or edit it manually.
39+
40+
### Direct CSV Path (CSVReader / standalone)
41+
42+
When reading CSV directly (e.g., via `CSVReader` or standalone execution), extra columns are parsed inline — no sidecar file is needed.
43+
44+
## Usage in Scripts
45+
46+
Access extra fields through `lib.extra_fields`, which is a `dict[str, Any]` updated each bar:
47+
48+
```python
49+
"""
50+
@pyne
51+
"""
52+
from pynecore import Series
53+
from pynecore.lib import script, ta, close, extra_fields, plot
54+
55+
56+
@script.indicator(title="Extra Fields Example", overlay=True)
57+
def main():
58+
# Access extra columns as Series by annotating with Series[T]
59+
rsi: Series[float] = extra_fields["rsi"]
60+
signal: Series[str] = extra_fields["signal"]
61+
62+
# Series indexing works — access previous bars
63+
prev_rsi = rsi[1] # Previous bar's RSI value
64+
rsi_2_ago = rsi[2] # RSI from 2 bars ago
65+
66+
# Use with built-in functions like any other Series
67+
rsi_sma = ta.sma(rsi, 14)
68+
69+
# Use string fields for conditional logic
70+
if signal[0] == "buy":
71+
plot(close, "Buy Signal", linewidth=2)
72+
```
73+
74+
### Key Points
75+
76+
- **Type annotation creates the Series**: Writing `rsi: Series[float] = extra_fields["rsi"]` makes `rsi` a proper Series with history. The `extra_fields["rsi"]` part just returns the current bar's value (a plain `float`).
77+
- **Supported types**: `float`, `int`, `str`, and `bool`. The type is detected automatically from the CSV data.
78+
- **Missing values**: Empty cells in the CSV appear as empty string (`''`) when read via CSVReader, or `NaN` when read via the binary OHLCV + sidecar path.
79+
- **No AST magic needed**: The standard Series annotation mechanism handles everything — there is no special treatment for `extra_fields` in the AST transformers.
80+
81+
## CSV Format
82+
83+
Your source CSV simply includes extra columns alongside the standard OHLCV columns:
84+
85+
```csv
86+
timestamp,open,high,low,close,volume,rsi,signal,custom_price
87+
2024-01-01T00:00:00,100.0,105.0,95.0,102.0,1000,45.2,buy,99.5
88+
2024-01-01T01:00:00,102.0,108.0,100.0,106.0,1200,52.1,,101.3
89+
2024-01-01T02:00:00,106.0,110.0,104.0,108.0,800,38.7,sell,
90+
```
91+
92+
The following column names are recognized as standard OHLCV and will **not** appear in `extra_fields`:
93+
94+
| Recognized OHLCV columns |
95+
|--------------------------------------------------------|
96+
| `timestamp`, `time`, `date`, `datetime` |
97+
| `open`, `high`, `low`, `close`, `volume` |
98+
99+
Any other column name is treated as an extra field.
100+
101+
## Sidecar File Format
102+
103+
The auto-generated `.extra.csv` file contains only the extra columns, with rows aligned 1:1 to the binary `.ohlcv` file (including gap-filled rows):
104+
105+
```csv
106+
rsi,signal,custom_price
107+
45.2,buy,99.5
108+
52.1,,101.3
109+
38.7,sell,
110+
,,
111+
,,
112+
42.0,hold,100.0
113+
```
114+
115+
Empty rows correspond to gap-filled bars in the OHLCV binary (bars with `volume = -1`).
116+
117+
## Limitations
118+
119+
- **Binary format unchanged**: The `.ohlcv` binary format remains fixed at 24 bytes per record. Extra fields are stored separately in the sidecar CSV.
120+
- **JSON source files**: Extra field extraction is currently supported for CSV and TXT source formats, not JSON.
121+
- **Memory**: The sidecar is loaded entirely into memory when opening the OHLCV file. For typical datasets (up to a few hundred thousand bars with a handful of extra columns), this is negligible.
122+
- **Not available from providers**: Data download providers (e.g., CCXT, TradingView) produce standard OHLCV data only. Extra fields are for user-provided CSV data.
123+
124+
## See Also
125+
126+
- [OHLCV Reader/Writer](./ohlcv-reader-writer.md) — Binary OHLCV format details
127+
- [CSV Reader/Writer](./csv-reader-writer.md) — CSV processing internals

src/pynecore/core/data_converter.py

Lines changed: 100 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
"""
77
from __future__ import annotations
88

9+
import csv
910
import json
1011
from enum import Enum
1112
from datetime import time
@@ -172,6 +173,14 @@ def convert_to_ohlcv(
172173
# Copy modification time from source to maintain freshness
173174
copy_mtime(file_path, ohlcv_path)
174175

176+
# Generate extra fields sidecar CSV if source has extra columns
177+
extra_csv_path = file_path.with_suffix('.extra.csv')
178+
if detected_format in ('csv', 'txt'):
179+
self._generate_extra_csv(file_path, ohlcv_path, extra_csv_path,
180+
detected_format == 'txt')
181+
if extra_csv_path.exists():
182+
copy_mtime(file_path, extra_csv_path)
183+
175184
# Generate TOML symbol info file if needed and not already loaded
176185
# skip_toml_generation is set earlier if TOML already exists (line 129-134)
177186
if symbol and not skip_toml_generation and (force or not toml_path.exists()):
@@ -239,14 +248,99 @@ def convert_to_ohlcv(
239248
pass
240249

241250
except Exception as e:
242-
# Clean up output file on error
243-
if ohlcv_path.exists():
244-
try:
245-
ohlcv_path.unlink()
246-
except OSError:
247-
pass
251+
# Clean up output files on error
252+
for cleanup_path in (ohlcv_path, file_path.with_suffix('.extra.csv')):
253+
if cleanup_path.exists():
254+
try:
255+
cleanup_path.unlink()
256+
except OSError:
257+
pass
248258
raise ConversionError(f"Failed to convert {file_path}: {e}") from e
249259

260+
# Column names that are part of standard OHLCV data (not extra fields)
261+
_OHLCV_COLUMNS = {
262+
'timestamp', 'time', 'date', 'datetime',
263+
'open', 'high', 'low', 'close', 'volume',
264+
}
265+
266+
def _generate_extra_csv(
267+
self,
268+
source_path: Path,
269+
ohlcv_path: Path,
270+
extra_csv_path: Path,
271+
is_txt: bool = False
272+
) -> None:
273+
"""
274+
Generate a sidecar .extra.csv file with non-OHLCV columns from the source data.
275+
The sidecar is position-aligned with the binary OHLCV file (including gap-filled rows).
276+
277+
:param source_path: Path to the original CSV/TXT file
278+
:param ohlcv_path: Path to the generated binary OHLCV file
279+
:param extra_csv_path: Path for the output sidecar CSV
280+
:param is_txt: True if source is TXT format (auto-detect delimiter)
281+
"""
282+
# Detect delimiter for TXT files
283+
delimiter = ','
284+
if is_txt:
285+
with open(source_path, 'r') as f:
286+
first_line = f.readline().strip()
287+
for delim in ['\t', ';', '|']:
288+
if delim in first_line:
289+
delimiter = delim
290+
break
291+
292+
# Read source headers and identify extra columns
293+
with open(source_path, 'r', newline='') as f:
294+
reader = csv.reader(f, delimiter=delimiter)
295+
raw_headers = next(reader, None)
296+
if not raw_headers:
297+
return
298+
299+
headers_lower = [h.lower().strip() for h in raw_headers]
300+
extra_indices = [
301+
i for i, h in enumerate(headers_lower)
302+
if h not in self._OHLCV_COLUMNS
303+
]
304+
305+
if not extra_indices:
306+
return
307+
308+
extra_headers = [raw_headers[i].strip() for i in extra_indices]
309+
310+
# Collect extra values from all source rows (in order)
311+
source_extra_rows: list[list[str]] = []
312+
for row in reader:
313+
if is_txt:
314+
row = [field.strip() for field in row]
315+
extra_row = [row[i] if i < len(row) else '' for i in extra_indices]
316+
source_extra_rows.append(extra_row)
317+
318+
if not source_extra_rows:
319+
return
320+
321+
# Align with OHLCV binary (which may have gap-filled rows)
322+
with OHLCVReader(ohlcv_path) as ohlcv_reader:
323+
total_positions = ohlcv_reader.size
324+
empty_row = [''] * len(extra_headers)
325+
source_idx = 0
326+
327+
with open(extra_csv_path, 'w', newline='') as out_f:
328+
writer = csv.writer(out_f)
329+
writer.writerow(extra_headers)
330+
331+
for pos in range(total_positions):
332+
ohlcv = ohlcv_reader.read(pos)
333+
if ohlcv.volume < 0:
334+
# Gap-filled row — write empty values
335+
writer.writerow(empty_row)
336+
else:
337+
# Real data row — consume next source row
338+
if source_idx < len(source_extra_rows):
339+
writer.writerow(source_extra_rows[source_idx])
340+
source_idx += 1
341+
else:
342+
writer.writerow(empty_row)
343+
250344
@staticmethod
251345
def detect_format(file_path: Path) -> Literal['csv', 'txt', 'json', 'ohlcv', 'unknown']:
252346
"""

src/pynecore/core/ohlcv_file.py

Lines changed: 63 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1365,7 +1365,8 @@ class OHLCVReader:
13651365
Very fast OHLCV data reader using memory mapping.
13661366
"""
13671367

1368-
__slots__ = ('path', '_file', '_mmap', '_size', '_start_timestamp', '_interval')
1368+
__slots__ = ('path', '_file', '_mmap', '_size', '_start_timestamp', '_interval',
1369+
'_extra_data', '_extra_headers')
13691370

13701371
def __init__(self, path: str | Path):
13711372
self.path = str(path)
@@ -1374,6 +1375,8 @@ def __init__(self, path: str | Path):
13741375
self._size = 0
13751376
self._start_timestamp = None
13761377
self._interval = None
1378+
self._extra_data: list[dict[str, int | float | str]] | None = None
1379+
self._extra_headers: list[str] | None = None
13771380

13781381
def __enter__(self):
13791382
self.open()
@@ -1467,8 +1470,59 @@ def open(self) -> 'OHLCVReader':
14671470
second_timestamp = struct.unpack('I', cast(Buffer, self._mmap[RECORD_SIZE:RECORD_SIZE + 4]))[0]
14681471
self._interval = second_timestamp - self._start_timestamp
14691472

1473+
self._load_extra_csv()
1474+
14701475
return self
14711476

1477+
def _load_extra_csv(self) -> None:
1478+
"""
1479+
Load extra fields from sidecar .extra.csv file if it exists.
1480+
The sidecar is position-aligned with the binary OHLCV file.
1481+
"""
1482+
extra_path = Path(self.path).with_suffix('.extra.csv')
1483+
if not extra_path.exists():
1484+
return
1485+
1486+
with open(extra_path, 'r', newline='') as f:
1487+
reader = csv.reader(f)
1488+
headers = next(reader, None)
1489+
if not headers:
1490+
return
1491+
1492+
self._extra_headers = headers
1493+
1494+
# Detect column types from first non-empty data row
1495+
rows_raw: list[list[str]] = []
1496+
col_is_numeric: list[bool | None] = [None] * len(headers)
1497+
1498+
for row in reader:
1499+
rows_raw.append(row)
1500+
for i, val in enumerate(row):
1501+
if col_is_numeric[i] is None and val and val.lower() not in ('', 'nan', 'na'):
1502+
try:
1503+
float(val)
1504+
col_is_numeric[i] = True
1505+
except ValueError:
1506+
col_is_numeric[i] = False
1507+
1508+
# Default undetected columns to string
1509+
col_is_numeric = [v if v is not None else False for v in col_is_numeric]
1510+
1511+
# Parse all rows with detected types
1512+
self._extra_data = []
1513+
for row in rows_raw:
1514+
parsed: dict[str, int | float | str] = {}
1515+
for i, header in enumerate(headers):
1516+
val = row[i] if i < len(row) else ''
1517+
if col_is_numeric[i]:
1518+
if not val or val.lower() in ('nan', 'na', ''):
1519+
parsed[header] = float('nan')
1520+
else:
1521+
parsed[header] = float(val)
1522+
else:
1523+
parsed[header] = val
1524+
self._extra_data.append(parsed)
1525+
14721526
def __iter__(self) -> Iterator[OHLCV]:
14731527
"""
14741528
Iterate through all candles
@@ -1487,7 +1541,12 @@ def read(self, position: int) -> OHLCV:
14871541

14881542
offset = position * RECORD_SIZE
14891543
data = struct.unpack(STRUCT_FORMAT, self._mmap[offset:offset + RECORD_SIZE])
1490-
return OHLCV(*data, extra_fields={})
1544+
1545+
extra = {}
1546+
if self._extra_data is not None and position < len(self._extra_data):
1547+
extra = self._extra_data[position]
1548+
1549+
return OHLCV(*data, extra_fields=extra)
14911550

14921551
def read_from(self, start_timestamp: int, end_timestamp: int | None = None, skip_gaps: bool = True) \
14931552
-> Iterator[OHLCV]:
@@ -1524,6 +1583,8 @@ def close(self):
15241583
if self._file:
15251584
self._file.close()
15261585
self._file = None
1586+
self._extra_data = None
1587+
self._extra_headers = None
15271588

15281589
def get_positions(self, start_timestamp: int | None = None, end_timestamp: int | None = None) -> tuple[int, int]:
15291590
"""

src/pynecore/core/script_runner.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,7 @@ def _set_lib_properties(ohlcv: OHLCV, bar_index: int, tz: 'ZoneInfo', lib: Modul
9898
lib.close = _round_price(ohlcv.close)
9999

100100
lib.volume = ohlcv.volume
101+
lib.extra_fields = ohlcv.extra_fields if ohlcv.extra_fields else {}
101102

102103
lib.hl2 = (lib.high + lib.low) / 2.0
103104
lib.hlc3 = (lib.high + lib.low + lib.close) / 3.0
@@ -162,6 +163,7 @@ def _reset_lib_vars(lib: ModuleType):
162163
lib._time = 0
163164
lib._datetime = datetime.fromtimestamp(0, UTC)
164165

166+
lib.extra_fields = {}
165167
lib._lib_semaphore = False
166168

167169
lib.barstate.isfirst = True

src/pynecore/lib/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,9 @@
9595
# Stores data to polot
9696
_plot_data: dict[str, Any] = {}
9797

98+
# Extra fields from CSV data (beyond OHLCV), populated each bar by ScriptRunner
99+
extra_fields: dict[str, Any] = {}
100+
98101
# Lib semaphore - to prevent lib`s main function to do things it must not (plot, strategy things, etc.)
99102
_lib_semaphore = False
100103

0 commit comments

Comments
 (0)