Specification¶

This document defines how this Python wcwidth library measures the printable width of characters of a string. This is not meant to an official standard, but as a terse description of the lowest level API functions wcwidth.wcwidth() and wcwidth.wcswidth() and its relation to higher level functions wcwidth.width() and wcwidth.iter_graphemes().

Scope¶

The lowest level functions wcwidth.wcwidth() and wcwidth.wcswidth() return -1 when any control codes are present. The higher level function wcwidth.width() never returns -1, accepting default arguments, control_codes='parse' and its behavior and options are described by its docstring and specifications of related control codes, XTerm Control Sequences and Kitty Text Sizing Protocol.

Each string yielded by wcwidth.iter_graphemes() may be mapped to wcwidth.wcswidth() to accurately measure the width of a grapheme. Although wcwidth.iter_graphemes() matches behavior of Python 3.15 uncodedata.iter_graphemes() it differs in its return value, wcwidth.iter_graphemes() yields only strings, while unicodedata.iter_graphemes() yields unicodedata.Segment class objects.

Width of -1¶

The following have a column width of -1 for function wcwidth.wcwidth()

C0 control characters (U+0001 through U+001F).
C1 control characters and DEL (U+007F through U+00A0).

If any character in sequence contains C0 or C1 control characters, the final return value of wcwidth.wcswidth() is -1.

Width of 0¶

Any characters with the Default_Ignorable_Code_Point property in DerivedCoreProperties.txt files, 4,174 characters, excluding U+00AD SOFT HYPHEN (width 1) and U+115F HANGUL CHOSEONG FILLER (width 2).

Any characters defined by General Category codes in DerivedGeneralCategory.txt files:

‘Me’: Enclosing Mark, aprox. 13 characters.
‘Mn’: Nonspacing Mark, aprox. 1,839 characters.
‘Cf’: Format control characters excluding U+00AD SOFT HYPHEN and Prepended_Concatenation_Mark characters, aprox. 147 characters.
‘Zl’: U+2028 LINE SEPARATOR only
‘Zp’: U+2029 PARAGRAPH SEPARATOR only
‘Sk’: Modifier Symbol, aprox. 1 character with 'FULLWIDTH' in comment of UnicodeData.txt (see Width of 2). Emoji Modifier Fitzpatrick symbols (U+1F3FB through U+1F3FF) are zero-width only when following an emoji base character in sequence; see Width of 2 for standalone.

The NULL character (U+0000).

Any character following ZWJ (U+200D) when preceded by an emoji (Extended_Pictographic property) or Regional Indicator in sequence by function wcwidth.wcswidth(), following grapheme cluster boundary rules of Unicode Standard Annex #29. When ZWJ follows a non-emoji character (including CJK), only the ZWJ itself is zero-width; the following character is measured normally.

The second Regional Indicator symbol (U+1F1E6 through U+1F1FF) in a consecutive pair, when measured in sequence by wcwidth.wcswidth() or wcwidth.width(). The first indicator of the pair is Width of 2.

Hangul Jamo Jungseong and “Extended-B” code blocks, U+1160 through U+11FF and U+D7B0 through U+D7FF.

Any characters of category Mc (Spacing Combining Mark), aprox. 443 characters, for the single-character function wcwidth.wcwidth(). When measured in sequence by wcwidth.wcswidth(), see Width of 2.

Width of 1¶

String characters are measured width of 1 when they are not measured as Width of 0 or Width of 2.

Width of 2¶

Any character defined by East Asian (Unicode Standard Annex #11) Fullwidth (F) or Wide (W) properties in EastAsianWidth.txt files, except those that are defined by the Category code of Nonspacing Mark (Mn).

Regional Indicator symbols (U+1F1E6 through U+1F1FF). Though classified as Neutral in EastAsianWidth.txt, terminals universally render these as double-width. A consecutive pair of Regional Indicators forms a flag emoji and is measured as width 2 total (first indicator is 2, second is 0).

Emoji Modifier Fitzpatrick symbols (U+1F3FB through U+1F3FF) when measured standalone (not following an emoji base character). When following an emoji base, they combine with the base and add 0 to total width.

Any characters of Modifier Symbol category, 'Sk' where 'FULLWIDTH' is present in comment of UnicodeData.txt, aprox. 3 characters.

Any character with U+FE0F (Variation Selector 16) defined as emoji style in emoji-variation-sequences.txt, per UTS #51 and Unicode Standard Section 23.4: VS16 adds 1 cell to the narrow character it directly follows, making the pair width 2. Wide characters are unchanged.

Any character with U+FE0E (Variation Selector 15) defined as text style in emoji-variation-sequences.txt, per UTS #51 and Unicode Standard Section 23.4: VS15 subtracts 1 cell from the wide character it directly follows, making the pair width 1. Narrow characters are unchanged.

Any character of non-zero width followed by an Mc (Spacing Combining Mark) character when measured in sequence by wcwidth.wcswidth() or wcwidth.width(). The Mc character caps the cluster width at 2, reflecting its positive advance width as defined in General Category (Table 4-4). Zero-width combining marks (Mn) between the base character and the Mc do not break the association. For example, a consonant followed by a Nukta (Mn) and then a vowel sign (Mc) is measured as a cluster of width 2.

Any grapheme cluster width is limited to 2 cells since 0.8.0, PR #224.

Virama Conjunct Formation¶

In Brahmic scripts, IndicSyllabicCategory.txt defines two categories that trigger conjunct formation between consonants: Virama (“may act as a Pure_Killer or Invisible_Stacker depending on context”) and Invisible_Stacker (“not visible by itself; causes conjunct formation or consonant stacking”, the “only as consonant stackers” category described in the Virama section header).

A Virama contributes 0 width.
Most viramas have category Mn, but six have category Mc (Spacing Combining Mark): these are recognised as viramas first, not as Mc, so they begin a conjunct rather than capping the cluster.
A Consonant immediately following a Virama adds its width to the current grapheme cluster.
The cluster total is capped at 2 cells since 0.8.0, PR #224.
Mn marks do not break conjunct context within the same aksara.
ZWJ (U+200D) after a virama is consumed without breaking conjunct state, supporting explicit half-form requests (virama + ZWJ + consonant).

See also: L2/2023/23107 “Proper Complex Script Support in Text Terminals”.