This post focuses on practical details and what they mean for your day-to-day development, with an eye toward where we’re headed next.
In our previous article, we announced the USTRING data type was coming back, and its intended role in Clarion 12’s Unicode support. Now, let’s explore the implementation details that will help you work more effectively with the USTRING.
USTRING is UTF-16: What This Means for You
At its core, the USTRING data type uses UTF-16 encoding, allocating two bytes per character. This architectural decision provides several key advantages:
- Native Windows support: Windows internally uses UTF-16 for all Unicode operations, making USTRING integration seamless
- Fixed-width benefits: Most common characters (including all Latin, Cyrillic, Greek, and CJK characters in the Basic Multilingual Plane) use exactly 2 bytes, simplifying string indexing
- Complete Unicode coverage: Through surrogate pairs, UTF-16 can represent every Unicode character
- Predictable memory usage: Easy calculation of memory requirements
Name USTRING(21) ! 21-character Unicode string
Company USTRING('SoftVelocity') ! Initialized with a value
Phone USTRING(@P(###)###-####P) ! Formatted with picture token
MyStr USTRING(20) ! 20 characters available
When you declare USTRING(20), you’re reserving space for 20 characters plus a null terminator. Internally, this allocates 42 bytes (21 characters × 2 bytes each).
Memory Layout: USTRING(20)
Declaration: USTRING(20)
Allocation: [ 40 bytes for 20 characters ][2 bytes null]
|<────────── 20 chars × 2 bytes ─────>|
Total Size: 42 bytes
Example with "Hello":
Position: 1 2 3 4 5 6-20 21
Character: H e l l o (empty) \0
Bytes: [H ][e ][l ][l ][o ][ ... ][\0]
2 2 2 2 2 30 2
WCHAR(0)
<─── 40 bytes for data ────> <2>
Total: 42 bytes allocated (40 for characters + 2 for null terminator)
Important: The null terminator WCHAR(0) occupies 2 bytes because it’s a wide character, just like every other character in the string. This is how functions like lstrlenW know where the string ends—they scan for this 2-byte null value.
Dual Character Set Support: The Best of Both Worlds
One of USTRING’s practical strengths is transparent handling of both Unicode and ANSI content. You can freely mix Unicode literals and ANSI strings in your code:
MYUSTR USTRING(50)
CODE
MYUSTR = u'Α Ω' ! Greek Unicode characters
MYUSTR = 'Regular Text' ! ANSI text works too
MYUSTR = u'Mix: ' & 'Α Ω' ! Concatenate both types
The runtime handles conversions automatically, respecting the current code page settings. When working with international applications, you can set the code page and locale to ensure proper character handling:
SYSTEM {PROP:Codepage} = 1253 ! Greece
SYSTEM {PROP:Locale} = 1032 ! Greece
LEN() vs SIZE(): A Critical Distinction
This is where developers first encounter USTRING’s two-byte nature. The distinction between LEN() and SIZE() directly reflects the UTF-16 implementation:
MYUSTR USTRING(20)
L LONG
S LONG
CODE
MYUSTR = u'Α Ω' ! 3 Unicode characters
L = LEN(MYUSTR) ! L = 3 (character count)
S = SIZE(MYUSTR) ! S = 40 (20 characters × 2 bytes)
LEN() returns the logical length—the number of characters actually stored in the string. This is what you typically care about when processing text.
SIZE() returns the allocated byte capacity. For a USTRING(20), SIZE() always returns 40, regardless of how many characters you’ve stored. This represents the maximum storage available.
Understanding this distinction matters when:
- Allocating buffers for string operations
- Interfacing with external APIs that expect byte counts
- Optimizing memory usage in data structures
- Working with file I/O operations
How SIZE() Actually Works
For fixed-size declarations like USTRING(20), SIZE() is calculated by the Clarion compiler at compile-time. The compiler knows the capacity is 20 characters and generates code that returns 20 × 2 = 40 directly—no runtime function call needed.
This is why SIZE() is so fast: it’s just a constant value, not a calculation that happens when your code runs.
When LEN() and SIZE() Differ: A Practical Example
UserInput USTRING(100) ! Allocated capacity: 100 chars
Bytes LONG
Chars LONG
CODE
UserInput = '' ! Empty string
Chars = LEN(UserInput) ! Chars = 0 (no content)
Bytes = SIZE(UserInput) ! Bytes = 200 (capacity still allocated)
UserInput = u'Hi' ! Short string
Chars = LEN(UserInput) ! Chars = 2 (actual content)
Bytes = SIZE(UserInput) ! Bytes = 200 (capacity unchanged)
! Key insight: SIZE() never changes after declaration
! LEN() reflects actual content
Memory Allocation: Understanding the 2:1 Ratio
When you declare a USTRING, the actual memory allocated is double the character count you specify:
Small USTRING(10) ! Allocates 20 bytes (10 × 2)
Medium USTRING(100) ! Allocates 200 bytes (100 × 2)
Large USTRING(1000) ! Allocates 2000 bytes (1000 × 2)
Right-Sizing Your Strings
Choose appropriate sizes to avoid wasting memory. Here’s what oversizing costs:
! Good - sized appropriately
FirstName USTRING(50) ! 100 bytes allocated
! Wasteful - unnecessarily large
UserName USTRING(500) ! 1000 bytes allocated
! If only ~50 chars used: 100 used, 900 wasted
Comment USTRING(5000) ! 10,000 bytes allocated
! If only ~100 chars used: 200 used, 9,800 wasted
When Memory Size Actually Matters
Understanding when to worry about USTRING memory overhead:
! Scenario 1: Single string - overhead is negligible
CustomerName USTRING(100) ! 200 bytes total
! Impact: Minimal - 100 extra bytes compared to ANSI
! Scenario 2: Large collections - overhead multiplies
CustomerQueue QUEUE
Name USTRING(100) ! 200 bytes
Address USTRING(200) ! 400 bytes
City USTRING(50) ! 100 bytes
END
! Impact with 100,000 records in queue:
! USTRING: 70,000,000 bytes (70 MB)
! STRING: 35,000,000 bytes (35 MB)
! Difference: 35 MB - this is where sizing matters!
! Or with an array:
CustomerArray USTRING(100), DIM(100000) ! 20,000,000 bytes (20 MB)
! vs STRING(100), DIM(100000) ! 10,000,000 bytes (10 MB)
Rule of thumb: For individual strings, use generous sizes. For large queues, arrays, or tables, size more carefully.
Design-Time vs Runtime Allocation
! Design-time: Fixed size declared in source
MyStr USTRING(100) ! 200 bytes allocated at compile time
! Runtime: Dynamic allocation with NEW
MyStr &USTRING ! Reference to dynamically allocated string
CODE
MyStr &= NEW USTRING(100) ! 200 bytes allocated at runtime
Design-time declarations have a maximum size of 4MB, while runtime allocations can be sized dynamically based on your application’s needs.
Working with Unicode Literals
When initializing or assigning to a USTRING, use the U or u prefix for Unicode literals:
MyStr USTRING(50)
CODE
MyStr = U'Ω α β' ! Correct - U prefix for Unicode
MyStr = u'Ω α β' ! Also correct - lowercase works too
MyStr = 'Ω α β' ! Works but may not preserve Unicode properly
Practical Implications for Your Code
Character Access is Read-Only on Assignment
You can read individual characters using slice syntax, but cannot assign to them:
C = MyStr[5] ! Read character at position 5 - ALLOWED
MyStr[1] = 'A' ! ERROR - Not allowed, creates invalid string
This restriction maintains string integrity in the UTF-16 implementation.
Use LEN() for Logic, SIZE() for Memory
! Correct usage
IF LEN(UserInput) > 0 ! Check if string has content
! Process input
END
! Memory allocation calculation
BytesNeeded = SIZE(MyStr) ! Get total allocated bytes
Current Limitations
The current implementation doesn’t support Unicode strings in these specific contexts:
EVALUATEstatementMATCHbuilt-in functionSTRPOSbuilt-in function
These are implementation-specific constraints that may be addressed in future releases.
Working Example: Practical USTRING Usage
MAP
MODULE('API')
GetSystemInfo(*LONG, *LONG), PROC, RAW, PASCAL, NAME('GetSystemInfo')
END
END
MyName USTRING(50)
MyCompany USTRING(100)
FullInfo USTRING(200)
CharCount LONG
ByteCount LONG
CODE
! Assign Unicode content
MyName = u'Αλέξανδρος' ! Greek name
MyCompany = u'SoftVelocity' ! Company name
! Concatenate strings
FullInfo = MyName & u' - ' & MyCompany
! Get character count and byte size
CharCount = LEN(FullInfo) ! Actual characters in string
ByteCount = SIZE(FullInfo) ! Total bytes allocated
! Display results
MESSAGE('Name: ' & MyName & |
'|Characters: ' & CharCount & |
'|Bytes Allocated: ' & ByteCount)
Behind the Scenes: What Happens When You Concatenate
When you write a string expression like this:
Result = FirstName & ' ' & LastName
The Clarion runtime evaluates it using a string stack—a temporary workspace for building the final result. Here’s the step-by-step process:
String Expression Evaluation
Step 1: Push FirstName onto stack → Stack: [FirstName]
Step 2: Push ' ' onto stack → Stack: [FirstName][' ']
Step 3: Concatenate top 2 items → Stack: [FirstName ]
Step 4: Push LastName onto stack → Stack: [FirstName ][LastName]
Step 5: Concatenate top 2 items → Stack: [FirstName LastName]
Step 6: Pop result into Result variable → Result gets final string
This stack-based approach doesn’t create temporary variables that need cleanup. The runtime handles all intermediate strings automatically, and they vanish when the expression completes.
Why this matters for you:
- Write complex expressions freely – No performance penalty for chaining operations
- No memory leaks – Intermediate results are cleaned up automatically
- Thread-safe by design – Each thread has its own string stack, no locking needed
- Efficient memory use – Stack allocation is faster than heap allocation for temporaries
Performance Implication
The string stack is why expressions like Name = FirstName & ' ' & MiddleName & ' ' & LastName don’t create memory leaks or slow down your application. Each intermediate result (FirstName & ' ', etc.) exists only temporarily on the stack and is automatically cleaned up.
Best practice: Write natural, readable string expressions. The runtime is optimized for this pattern.
Common Pitfalls and How to Avoid Them
Pitfall 1: Using SIZE() When You Mean LEN()
! WRONG - This won't work as expected
Name USTRING(50)
CODE
Name = u'John'
IF SIZE(Name) > 10 ! Always TRUE (SIZE is 100, not 8)
! This always executes
END
! CORRECT - Use LEN() for content checks
IF LEN(Name) > 10 ! FALSE (LEN is 4)
! This executes only when needed
END
Pitfall 2: Buffer Size Confusion
! WRONG - Allocating based on character count for bytes
Name USTRING(50)
Buffer STRING(LEN(Name)) ! Too small! Only 50 bytes, need 100
! CORRECT - Use SIZE() for byte allocations
Buffer STRING(SIZE(Name)) ! Correct: 100 bytes
Pitfall 3: Forgetting the U Prefix
Greek USTRING(20)
CODE
! INEFFICIENT - ANSI string converted to Unicode at runtime
Greek = 'Αθήνα'
! EFFICIENT - Direct Unicode assignment, no conversion
Greek = u'Αθήνα'
Migration from STRING to USTRING: Real Examples
Example 1: Buffer Sizing
! Before (ANSI STRING)
Name STRING(50) ! 50 bytes
Buffer STRING(SIZE(Name)) ! 50 bytes
! After (USTRING)
Name USTRING(50) ! 100 bytes (50 × 2)
Buffer STRING(SIZE(Name)) ! 100 bytes - SIZE() handles it correctly
Example 2: Loop Iterations
! Before (ANSI STRING)
Text STRING(100)
I LONG
CODE
LOOP I = 1 TO LEN(Text) ! Good - use LEN() not SIZE()
! Process Text[I]
END
! After (USTRING)
Text USTRING(100)
I LONG
CODE
LOOP I = 1 TO LEN(Text) ! Same - LEN() still correct
! Process Text[I]
END
! KEY: LEN() works the same way for both types!
Example 3: API Calls
! Before (ANSI STRING)
Buffer STRING(1000)
Size LONG
CODE
Size = SIZE(Buffer) ! 1000 bytes
! Pass Size to Windows API expecting byte count
! After (USTRING)
Buffer USTRING(1000)
Size LONG
CODE
Size = SIZE(Buffer) ! 2000 bytes (1000 × 2)
! SIZE() correctly returns byte count for Unicode APIs
Migration Considerations
When moving existing ANSI string code to USTRING:
- Use
LEN()for character-based logic, notSIZE() - Add
Uprefix to string literals containing Unicode characters - Test with international character sets if your application supports them
- Be aware of the
EVALUATE,MATCH, andSTRPOSlimitations - Review buffer size calculations—you may need double the byte count you used with ANSI
Looking Forward
The USTRING implementation provides a solid foundation for Unicode support while maintaining the Clarion language’s characteristic simplicity. The UTF-16 encoding, dual character set support, and clear LEN/SIZE distinction give you the tools needed for modern, international applications.
Key takeaways:
- USTRING uses UTF-16 encoding (2 bytes per character)
- Automatic conversion between ANSI and Unicode character sets
- LEN() returns character count; SIZE() returns byte count
- USTRING(n) allocates n × 2 bytes of memory
- Code page awareness ensures proper locale handling
- Runtime uses string stack for efficient expression evaluation
Thanks for being part of the Clarion community. If you try this out, let us know what you think — and stay tuned, there’s more to come.








