Specification¶
This document defines how this Python wcwidth library measures the printable width of characters of
a string. This is not meant to an official standard, but as a terse description of the lowest level
API functions wcwidth.wcwidth() and wcwidth.wcswidth() and its relation to higher level
functions wcwidth.width() and wcwidth.iter_graphemes().
Scope¶
The lowest level functions wcwidth.wcwidth() and wcwidth.wcswidth() return -1 when any
control codes are present. The higher level function wcwidth.width() never returns -1,
accepting default arguments, control_codes='parse' and its behavior and options are described by
its docstring and specifications of related control codes, XTerm Control Sequences and Kitty
Text Sizing Protocol.
Each string yielded by wcwidth.iter_graphemes() may be mapped to wcwidth.wcswidth() to
accurately measure the width of a grapheme. Although wcwidth.iter_graphemes() matches
behavior of Python 3.15 uncodedata.iter_graphemes() it differs in its return value,
wcwidth.iter_graphemes() yields only strings, while unicodedata.iter_graphemes() yields
unicodedata.Segment class objects.
Width of -1¶
The following have a column width of -1 for function wcwidth.wcwidth()
If any character in sequence contains C0 or C1 control characters, the final
return value of wcwidth.wcswidth() is -1.
Width of 0¶
Any characters with the Default_Ignorable_Code_Point property in DerivedCoreProperties.txt files, 4,174 characters, excluding U+00AD SOFT HYPHEN (width 1) and U+115F HANGUL CHOSEONG FILLER (width 2).
Any characters defined by General Category codes in DerivedGeneralCategory.txt files:
‘Me’: Enclosing Mark, aprox. 13 characters.
‘Mn’: Nonspacing Mark, aprox. 1,839 characters.
‘Cf’: Format control characters excluding U+00AD SOFT HYPHEN and Prepended_Concatenation_Mark characters, aprox. 147 characters.
‘Zl’: U+2028 LINE SEPARATOR only
‘Zp’: U+2029 PARAGRAPH SEPARATOR only
‘Sk’: Modifier Symbol, aprox. 1 character with
'FULLWIDTH'in comment of UnicodeData.txt (see Width of 2). Emoji Modifier Fitzpatrick symbols (U+1F3FB through U+1F3FF) are zero-width only when following an emoji base character in sequence; see Width of 2 for standalone.
The NULL character (U+0000).
Any character following ZWJ (U+200D) when preceded by an emoji
(Extended_Pictographic property) or Regional Indicator in sequence by
function wcwidth.wcswidth(), following grapheme cluster boundary rules
of Unicode Standard Annex #29. When ZWJ follows a non-emoji character
(including CJK), only the ZWJ itself is zero-width; the following character
is measured normally.
The second Regional Indicator symbol (U+1F1E6 through U+1F1FF) in a
consecutive pair, when measured in sequence by wcwidth.wcswidth() or
wcwidth.width(). The first indicator of the pair is Width of 2.
Hangul Jamo Jungseong and “Extended-B” code blocks, U+1160 through U+11FF and U+D7B0 through U+D7FF.
Any characters of category Mc (Spacing Combining Mark), aprox. 443
characters, for the single-character function wcwidth.wcwidth().
When measured in sequence by wcwidth.wcswidth(), see Width of 2.
Width of 1¶
String characters are measured width of 1 when they are not measured as Width of 0 or Width of 2.
Width of 2¶
Any character defined by East Asian (Unicode Standard Annex #11) Fullwidth
(F) or Wide (W) properties in EastAsianWidth.txt files, except those
that are defined by the Category code of Nonspacing Mark (Mn).
Regional Indicator symbols (U+1F1E6 through U+1F1FF). Though classified as Neutral in EastAsianWidth.txt, terminals universally render these as double-width. A consecutive pair of Regional Indicators forms a flag emoji and is measured as width 2 total (first indicator is 2, second is 0).
Emoji Modifier Fitzpatrick symbols (U+1F3FB through U+1F3FF) when measured standalone (not following an emoji base character). When following an emoji base, they combine with the base and add 0 to total width.
Any characters of Modifier Symbol category, 'Sk' where 'FULLWIDTH' is
present in comment of UnicodeData.txt, aprox. 3 characters.
Any character with U+FE0F (Variation Selector 16) defined as emoji style
in emoji-variation-sequences.txt, per UTS #51 and Unicode Standard
Section 23.4: VS16 adds 1 cell to the narrow character it directly follows,
making the pair width 2. Wide characters are unchanged.
Any character with U+FE0E (Variation Selector 15) defined as text style
in emoji-variation-sequences.txt, per UTS #51 and Unicode Standard
Section 23.4: VS15 subtracts 1 cell from the wide character it directly
follows, making the pair width 1. Narrow characters are unchanged.
Any character of non-zero width followed by an Mc (Spacing Combining Mark)
character when measured in sequence by wcwidth.wcswidth() or
wcwidth.width(). The Mc character caps the cluster width at 2,
reflecting its positive advance width as defined in General Category
(Table 4-4). Zero-width combining marks (Mn) between the base character
and the Mc do not break the association. For example, a consonant followed
by a Nukta (Mn) and then a vowel sign (Mc) is measured as a cluster of
width 2.
Any grapheme cluster width is limited to 2 cells since 0.8.0, PR #224.
Virama Conjunct Formation¶
In Brahmic scripts, IndicSyllabicCategory.txt defines two categories that trigger conjunct formation between consonants: Virama (“may act as a Pure_Killer or Invisible_Stacker depending on context”) and Invisible_Stacker (“not visible by itself; causes conjunct formation or consonant stacking”, the “only as consonant stackers” category described in the Virama section header).
A
Viramacontributes 0 width.Most viramas have category
Mn, but six have categoryMc(Spacing Combining Mark): these are recognised as viramas first, not asMc, so they begin a conjunct rather than capping the cluster.A
Consonantimmediately following aViramaadds its width to the current grapheme cluster.The cluster total is capped at 2 cells since 0.8.0, PR #224.
Mnmarks do not break conjunct context within the same aksara.ZWJ (U+200D) after a virama is consumed without breaking conjunct state, supporting explicit half-form requests (virama + ZWJ + consonant).
See also: L2/2023/23107 “Proper Complex Script Support in Text Terminals”.