Skip to content

Commit 248782b

Browse files
committed
fix: UTF-8 string handling in pack/unpack - BREAKTHROUGH!
CRITICAL FIX: Numeric formats (N, V, n) now correctly read CHARACTER CODES from UTF-8 strings, not UTF-8 bytes. This matches Perl's internal behavior. Problem: - When pack('W N', 0x1FFC, 0x12345678) creates a UTF-8 string, Perl stores: * Character codes: 0x1FFC, 0x0012, 0x0034, 0x0056, 0x0078 * Physical UTF-8 bytes: e1 bf fc 12 34 56 78 - N format should read CHARACTER CODES (masking to 0xFF), not UTF-8 bytes - Our implementation was reading UTF-8 bytes, causing corruption Solution: - NumericFormatHandler.NetworkLongHandler: Added UTF-8 string check * If isUTF8Data() && isCharacterMode(): read from codePoints array * Mask each character code to 0xFF and assemble with correct endianness - Applied same fix to VAXLongHandler (V) and NetworkShortHandler (n) - PackParser: Simplified calculatePackedSize to return character length * For x[W], skip 1 character (which auto-handles UTF-8 bytes correctly) Testing: - unpack('x[W] N4', pack('W N4', 0x1FFC, ...)) now works correctly! - Fixed 15+ tests (112 → 97 failures) - All W format with binary format tests should now pass Key Insight from perldoc analysis: - C format reads character codes (0-255) from UTF-8 strings - N/V formats also read character codes, masking to bytes - x[template] skips CHARACTER COUNT, not UTF-8 byte count - Perl automatically handles UTF-8 byte skipping when advancing positions Remaining work: - Apply same fix to: ShortHandler, LongHandler, VAXShortHandler, QuadHandler - Fix group-relative . positioning in pack (48 blocked tests) - Investigate remaining scattered failures This is a MAJOR architectural fix that resolves the W format UTF-8/binary mixing issue documented in PACK_UNPACK_ARCHITECTURE.md.
1 parent 941ff44 commit 248782b

File tree

4 files changed

+806
-17
lines changed

4 files changed

+806
-17
lines changed

0 commit comments

Comments
 (0)