Monthly Archives: January 2026

Understanding USTRING: A Deep Dive into Clarion 12’s UTF-16 Implementation

This post focuses on practical details and what they mean for your day-to-day development, with an eye toward where we’re headed next.

In our previous article, we announced the USTRING data type was coming back, and its intended role in Clarion 12’s Unicode support. Now, let’s explore the implementation details that will help you work more effectively with the USTRING.

USTRING is UTF-16: What This Means for You

At its core, the USTRING data type uses UTF-16 encoding, allocating two bytes per character. This architectural decision provides several key advantages:

  • Native Windows support: Windows internally uses UTF-16 for all Unicode operations, making USTRING integration seamless
  • Fixed-width benefits: Most common characters (including all Latin, Cyrillic, Greek, and CJK characters in the Basic Multilingual Plane) use exactly 2 bytes, simplifying string indexing
  • Complete Unicode coverage: Through surrogate pairs, UTF-16 can represent every Unicode character
  • Predictable memory usage: Easy calculation of memory requirements
Name     USTRING(21)              ! 21-character Unicode string
Company  USTRING('SoftVelocity')  ! Initialized with a value
Phone    USTRING(@P(###)###-####P) ! Formatted with picture token
MyStr    USTRING(20)              ! 20 characters available

When you declare USTRING(20), you’re reserving space for 20 characters plus a null terminator. Internally, this allocates 42 bytes (21 characters × 2 bytes each).

Memory Layout: USTRING(20)

Declaration:  USTRING(20)
Allocation:   [    40 bytes for 20 characters    ][2 bytes null]
              |<────────── 20 chars × 2 bytes ─────>|
Total Size:   42 bytes

Example with "Hello":
Position:     1    2    3    4    5    6-20  21
Character:    H    e    l    l    o    (empty) \0
Bytes:       [H ][e ][l ][l ][o ][  ...  ][\0]
              2   2   2   2   2      30      2
                                          WCHAR(0)
              <─── 40 bytes for data ────> <2>

Total: 42 bytes allocated (40 for characters + 2 for null terminator)

Important: The null terminator WCHAR(0) occupies 2 bytes because it’s a wide character, just like every other character in the string. This is how functions like lstrlenW know where the string ends—they scan for this 2-byte null value.

Dual Character Set Support: The Best of Both Worlds

One of USTRING’s practical strengths is transparent handling of both Unicode and ANSI content. You can freely mix Unicode literals and ANSI strings in your code:

MYUSTR  USTRING(50)

CODE
  MYUSTR = u'Α Ω'            ! Greek Unicode characters
  MYUSTR = 'Regular Text'    ! ANSI text works too
  MYUSTR = u'Mix: ' & 'Α Ω' ! Concatenate both types

The runtime handles conversions automatically, respecting the current code page settings. When working with international applications, you can set the code page and locale to ensure proper character handling:

SYSTEM {PROP:Codepage} = 1253   ! Greece
SYSTEM {PROP:Locale} = 1032     ! Greece

LEN() vs SIZE(): A Critical Distinction

This is where developers first encounter USTRING’s two-byte nature. The distinction between LEN() and SIZE() directly reflects the UTF-16 implementation:

MYUSTR  USTRING(20)
L       LONG
S       LONG

CODE
  MYUSTR = u'Α Ω'            ! 3 Unicode characters
  L = LEN(MYUSTR)            ! L = 3 (character count)
  S = SIZE(MYUSTR)           ! S = 40 (20 characters × 2 bytes)

LEN() returns the logical length—the number of characters actually stored in the string. This is what you typically care about when processing text.

SIZE() returns the allocated byte capacity. For a USTRING(20), SIZE() always returns 40, regardless of how many characters you’ve stored. This represents the maximum storage available.

Understanding this distinction matters when:

  • Allocating buffers for string operations
  • Interfacing with external APIs that expect byte counts
  • Optimizing memory usage in data structures
  • Working with file I/O operations

How SIZE() Actually Works

For fixed-size declarations like USTRING(20), SIZE() is calculated by the Clarion compiler at compile-time. The compiler knows the capacity is 20 characters and generates code that returns 20 × 2 = 40 directly—no runtime function call needed.

This is why SIZE() is so fast: it’s just a constant value, not a calculation that happens when your code runs.

When LEN() and SIZE() Differ: A Practical Example

UserInput  USTRING(100)      ! Allocated capacity: 100 chars
Bytes      LONG
Chars      LONG

CODE
  UserInput = ''             ! Empty string
  Chars = LEN(UserInput)     ! Chars = 0 (no content)
  Bytes = SIZE(UserInput)    ! Bytes = 200 (capacity still allocated)

  UserInput = u'Hi'          ! Short string
  Chars = LEN(UserInput)     ! Chars = 2 (actual content)
  Bytes = SIZE(UserInput)    ! Bytes = 200 (capacity unchanged)

  ! Key insight: SIZE() never changes after declaration
  ! LEN() reflects actual content

Memory Allocation: Understanding the 2:1 Ratio

When you declare a USTRING, the actual memory allocated is double the character count you specify:

Small   USTRING(10)              ! Allocates 20 bytes (10 × 2)
Medium  USTRING(100)             ! Allocates 200 bytes (100 × 2)
Large   USTRING(1000)            ! Allocates 2000 bytes (1000 × 2)

Right-Sizing Your Strings

Choose appropriate sizes to avoid wasting memory. Here’s what oversizing costs:

! Good - sized appropriately
FirstName  USTRING(50)            ! 100 bytes allocated

! Wasteful - unnecessarily large
UserName   USTRING(500)           ! 1000 bytes allocated
                                  ! If only ~50 chars used: 100 used, 900 wasted

Comment    USTRING(5000)          ! 10,000 bytes allocated
                                  ! If only ~100 chars used: 200 used, 9,800 wasted

When Memory Size Actually Matters

Understanding when to worry about USTRING memory overhead:

! Scenario 1: Single string - overhead is negligible
CustomerName  USTRING(100)     ! 200 bytes total
! Impact: Minimal - 100 extra bytes compared to ANSI

! Scenario 2: Large collections - overhead multiplies
CustomerQueue QUEUE
Name            USTRING(100)   ! 200 bytes
Address         USTRING(200)   ! 400 bytes
City            USTRING(50)    ! 100 bytes
              END

! Impact with 100,000 records in queue:
! USTRING: 70,000,000 bytes (70 MB)
! STRING:  35,000,000 bytes (35 MB)
! Difference: 35 MB - this is where sizing matters!

! Or with an array:
CustomerArray USTRING(100), DIM(100000)  ! 20,000,000 bytes (20 MB)
! vs STRING(100), DIM(100000)            ! 10,000,000 bytes (10 MB)

Rule of thumb: For individual strings, use generous sizes. For large queues, arrays, or tables, size more carefully.

Design-Time vs Runtime Allocation

! Design-time: Fixed size declared in source
MyStr  USTRING(100)          ! 200 bytes allocated at compile time

! Runtime: Dynamic allocation with NEW
MyStr &USTRING               ! Reference to dynamically allocated string
CODE
  MyStr &= NEW USTRING(100)  ! 200 bytes allocated at runtime

Design-time declarations have a maximum size of 4MB, while runtime allocations can be sized dynamically based on your application’s needs.

Working with Unicode Literals

When initializing or assigning to a USTRING, use the U or u prefix for Unicode literals:

MyStr USTRING(50)
CODE
  MyStr = U'Ω α β'     ! Correct - U prefix for Unicode
  MyStr = u'Ω α β'     ! Also correct - lowercase works too
  MyStr = 'Ω α β'      ! Works but may not preserve Unicode properly

Practical Implications for Your Code

Character Access is Read-Only on Assignment

You can read individual characters using slice syntax, but cannot assign to them:

C = MyStr[5]        ! Read character at position 5 - ALLOWED
MyStr[1] = 'A'      ! ERROR - Not allowed, creates invalid string

This restriction maintains string integrity in the UTF-16 implementation.

Use LEN() for Logic, SIZE() for Memory

! Correct usage
IF LEN(UserInput) > 0            ! Check if string has content
  ! Process input
END

! Memory allocation calculation
BytesNeeded = SIZE(MyStr)        ! Get total allocated bytes

Current Limitations

The current implementation doesn’t support Unicode strings in these specific contexts:

  • EVALUATE statement
  • MATCH built-in function
  • STRPOS built-in function

These are implementation-specific constraints that may be addressed in future releases.

Working Example: Practical USTRING Usage

MAP
  MODULE('API')
    GetSystemInfo(*LONG, *LONG), PROC, RAW, PASCAL, NAME('GetSystemInfo')
  END
END

MyName    USTRING(50)
MyCompany USTRING(100)
FullInfo  USTRING(200)
CharCount LONG
ByteCount LONG

CODE
  ! Assign Unicode content
  MyName = u'Αλέξανδρος'       ! Greek name
  MyCompany = u'SoftVelocity'   ! Company name
  
  ! Concatenate strings
  FullInfo = MyName & u' - ' & MyCompany
  
  ! Get character count and byte size
  CharCount = LEN(FullInfo)     ! Actual characters in string
  ByteCount = SIZE(FullInfo)    ! Total bytes allocated
  
  ! Display results
  MESSAGE('Name: ' & MyName & |
          '|Characters: ' & CharCount & |
          '|Bytes Allocated: ' & ByteCount)

Behind the Scenes: What Happens When You Concatenate

When you write a string expression like this:

Result = FirstName & ' ' & LastName

The Clarion runtime evaluates it using a string stack—a temporary workspace for building the final result. Here’s the step-by-step process:

String Expression Evaluation

Step 1: Push FirstName onto stack       → Stack: [FirstName]
Step 2: Push ' ' onto stack             → Stack: [FirstName][' ']
Step 3: Concatenate top 2 items         → Stack: [FirstName ]
Step 4: Push LastName onto stack        → Stack: [FirstName ][LastName]
Step 5: Concatenate top 2 items         → Stack: [FirstName LastName]
Step 6: Pop result into Result variable → Result gets final string

This stack-based approach doesn’t create temporary variables that need cleanup. The runtime handles all intermediate strings automatically, and they vanish when the expression completes.

Why this matters for you:

  • Write complex expressions freely – No performance penalty for chaining operations
  • No memory leaks – Intermediate results are cleaned up automatically
  • Thread-safe by design – Each thread has its own string stack, no locking needed
  • Efficient memory use – Stack allocation is faster than heap allocation for temporaries

Performance Implication

The string stack is why expressions like Name = FirstName & ' ' & MiddleName & ' ' & LastName don’t create memory leaks or slow down your application. Each intermediate result (FirstName & ' ', etc.) exists only temporarily on the stack and is automatically cleaned up.

Best practice: Write natural, readable string expressions. The runtime is optimized for this pattern.

Common Pitfalls and How to Avoid Them

Pitfall 1: Using SIZE() When You Mean LEN()

! WRONG - This won't work as expected
Name  USTRING(50)
CODE
  Name = u'John'
  IF SIZE(Name) > 10         ! Always TRUE (SIZE is 100, not 8)
    ! This always executes
  END

! CORRECT - Use LEN() for content checks
  IF LEN(Name) > 10          ! FALSE (LEN is 4)
    ! This executes only when needed
  END

Pitfall 2: Buffer Size Confusion

! WRONG - Allocating based on character count for bytes
Name     USTRING(50)
Buffer   STRING(LEN(Name))    ! Too small! Only 50 bytes, need 100

! CORRECT - Use SIZE() for byte allocations
Buffer   STRING(SIZE(Name))   ! Correct: 100 bytes

Pitfall 3: Forgetting the U Prefix

Greek  USTRING(20)
CODE
  ! INEFFICIENT - ANSI string converted to Unicode at runtime
  Greek = 'Αθήνα'

  ! EFFICIENT - Direct Unicode assignment, no conversion
  Greek = u'Αθήνα'

Migration from STRING to USTRING: Real Examples

Example 1: Buffer Sizing

! Before (ANSI STRING)
Name    STRING(50)           ! 50 bytes
Buffer  STRING(SIZE(Name))   ! 50 bytes

! After (USTRING)
Name    USTRING(50)          ! 100 bytes (50 × 2)
Buffer  STRING(SIZE(Name))   ! 100 bytes - SIZE() handles it correctly

Example 2: Loop Iterations

! Before (ANSI STRING)
Text  STRING(100)
I     LONG
CODE
  LOOP I = 1 TO LEN(Text)    ! Good - use LEN() not SIZE()
    ! Process Text[I]
  END

! After (USTRING)
Text  USTRING(100)
I     LONG
CODE
  LOOP I = 1 TO LEN(Text)    ! Same - LEN() still correct
    ! Process Text[I]
  END
  ! KEY: LEN() works the same way for both types!

Example 3: API Calls

! Before (ANSI STRING)
Buffer  STRING(1000)
Size    LONG
CODE
  Size = SIZE(Buffer)        ! 1000 bytes
  ! Pass Size to Windows API expecting byte count

! After (USTRING)
Buffer  USTRING(1000)
Size    LONG
CODE
  Size = SIZE(Buffer)        ! 2000 bytes (1000 × 2)
  ! SIZE() correctly returns byte count for Unicode APIs

Migration Considerations

When moving existing ANSI string code to USTRING:

  • Use LEN() for character-based logic, not SIZE()
  • Add U prefix to string literals containing Unicode characters
  • Test with international character sets if your application supports them
  • Be aware of the EVALUATE, MATCH, and STRPOS limitations
  • Review buffer size calculations—you may need double the byte count you used with ANSI

Looking Forward

The USTRING implementation provides a solid foundation for Unicode support while maintaining the Clarion language’s characteristic simplicity. The UTF-16 encoding, dual character set support, and clear LEN/SIZE distinction give you the tools needed for modern, international applications.

Key takeaways:

  • USTRING uses UTF-16 encoding (2 bytes per character)
  • Automatic conversion between ANSI and Unicode character sets
  • LEN() returns character count; SIZE() returns byte count
  • USTRING(n) allocates n × 2 bytes of memory
  • Code page awareness ensures proper locale handling
  • Runtime uses string stack for efficient expression evaluation

Thanks for being part of the Clarion community. If you try this out, let us know what you think — and stay tuned, there’s more to come.


Related: Clarion 12 Beta: USTRING Returns ANSI & Unicode