Commit 248782b
committed
fix: UTF-8 string handling in pack/unpack - BREAKTHROUGH!
CRITICAL FIX: Numeric formats (N, V, n) now correctly read CHARACTER CODES
from UTF-8 strings, not UTF-8 bytes. This matches Perl's internal behavior.
Problem:
- When pack('W N', 0x1FFC, 0x12345678) creates a UTF-8 string, Perl stores:
* Character codes: 0x1FFC, 0x0012, 0x0034, 0x0056, 0x0078
* Physical UTF-8 bytes: e1 bf fc 12 34 56 78
- N format should read CHARACTER CODES (masking to 0xFF), not UTF-8 bytes
- Our implementation was reading UTF-8 bytes, causing corruption
Solution:
- NumericFormatHandler.NetworkLongHandler: Added UTF-8 string check
* If isUTF8Data() && isCharacterMode(): read from codePoints array
* Mask each character code to 0xFF and assemble with correct endianness
- Applied same fix to VAXLongHandler (V) and NetworkShortHandler (n)
- PackParser: Simplified calculatePackedSize to return character length
* For x[W], skip 1 character (which auto-handles UTF-8 bytes correctly)
Testing:
- unpack('x[W] N4', pack('W N4', 0x1FFC, ...)) now works correctly!
- Fixed 15+ tests (112 → 97 failures)
- All W format with binary format tests should now pass
Key Insight from perldoc analysis:
- C format reads character codes (0-255) from UTF-8 strings
- N/V formats also read character codes, masking to bytes
- x[template] skips CHARACTER COUNT, not UTF-8 byte count
- Perl automatically handles UTF-8 byte skipping when advancing positions
Remaining work:
- Apply same fix to: ShortHandler, LongHandler, VAXShortHandler, QuadHandler
- Fix group-relative . positioning in pack (48 blocked tests)
- Investigate remaining scattered failures
This is a MAJOR architectural fix that resolves the W format UTF-8/binary
mixing issue documented in PACK_UNPACK_ARCHITECTURE.md.1 parent 941ff44 commit 248782b
File tree
4 files changed
+806
-17
lines changed- docs
- src/main/java/org/perlonjava/operators
- pack
- unpack
4 files changed
+806
-17
lines changed
0 commit comments