.. _Specification:

=============
Specification
=============

This document defines how this Python wcwidth library measures the printable width of characters of
a string. This is not meant to an official standard, but as a terse description of the lowest level
API functions :func:`wcwidth.wcwidth` and  :func:`wcwidth.wcswidth` and its relation to higher level
functions :func:`wcwidth.width` and :func:`wcwidth.iter_graphemes`.

Scope
-----

The lowest level functions :func:`wcwidth.wcwidth` and  :func:`wcwidth.wcswidth` return -1 when any
control codes are present.  The higher level function :func:`wcwidth.width` never returns -1,
accepting default arguments, ``control_codes='parse'`` and its behavior and options are described by
its docstring and specifications of related control codes, `XTerm Control Sequences`_ and `Kitty
Text Sizing Protocol`_.

Each string yielded by :func:`wcwidth.iter_graphemes` may be mapped to :func:`wcwidth.wcswidth` to
accurately measure the width of a grapheme. Although :func:`wcwidth.iter_graphemes` matches
behavior of Python 3.15 `uncodedata.iter_graphemes()`_ it differs in its return value,
:func:`wcwidth.iter_graphemes` yields only strings, while :func:`unicodedata.iter_graphemes` yields
``unicodedata.Segment`` class objects.

Width of -1
-----------

The following have a column width of -1 for function :func:`wcwidth.wcwidth`

- ``C0`` control characters (`U+0001`_ through `U+001F`_).
- ``C1`` control characters and ``DEL`` (`U+007F`_ through `U+00A0`_).

If any character in sequence contains ``C0`` or ``C1`` control characters, the final
return value of :func:`wcwidth.wcswidth` is -1.

Width of 0
----------

Any characters with the `Default_Ignorable_Code_Point`_ property in
`DerivedCoreProperties.txt`_ files, 4,174 characters, excluding `U+00AD`_ SOFT HYPHEN
(width 1) and `U+115F`_ HANGUL CHOSEONG FILLER (width 2).

Any characters defined by `General Category`_ codes in `DerivedGeneralCategory.txt`_ files:

- 'Me': `Enclosing Mark`_, aprox. 13 characters.
- 'Mn': `Nonspacing Mark`_, aprox. 1,839 characters.
- 'Cf': `Format`_ control characters excluding `U+00AD`_ SOFT HYPHEN and
  `Prepended_Concatenation_Mark`_ characters, aprox. 147 characters.
- 'Zl': `U+2028`_ LINE SEPARATOR only
- 'Zp': `U+2029`_ PARAGRAPH SEPARATOR only
- 'Sk': `Modifier Symbol`_, aprox. 1 character with ``'FULLWIDTH'`` in comment
  of `UnicodeData.txt`_ (see `Width of 2`_). `Emoji Modifier`_ Fitzpatrick
  symbols (`U+1F3FB`_ through `U+1F3FF`_) are zero-width only when following
  an emoji base character in sequence; see `Width of 2`_ for standalone.

The NULL character (`U+0000`_).

Any character following ZWJ (`U+200D`_) when preceded by an emoji
(`Extended_Pictographic`_ property) or `Regional Indicator`_ in sequence by
function :func:`wcwidth.wcswidth`, following grapheme cluster boundary rules
of `Unicode Standard Annex #29`_. When ZWJ follows a non-emoji character
(including CJK), only the ZWJ itself is zero-width; the following character
is measured normally.

The second `Regional Indicator`_ symbol (`U+1F1E6`_ through `U+1F1FF`_) in a
consecutive pair, when measured in sequence by :func:`wcwidth.wcswidth` or
:func:`wcwidth.width`. The first indicator of the pair is `Width of 2`_.

`Hangul Jamo`_ Jungseong and "Extended-B" code blocks, `U+1160`_ through
`U+11FF`_ and `U+D7B0`_ through `U+D7FF`_.

Any characters of category ``Mc`` (`Spacing Combining Mark`_), aprox. 443
characters, for the single-character function :func:`wcwidth.wcwidth`.
When measured in sequence by :func:`wcwidth.wcswidth`, see `Width of 2`_.

Width of 1
----------

String characters are measured width of 1 when they are not
measured as `Width of 0`_ or `Width of 2`_.

Width of 2
----------

Any character defined by `East Asian`_ (`Unicode Standard Annex #11`_) Fullwidth
(``F``) or Wide (``W``) properties in `EastAsianWidth.txt`_ files, except those
that are defined by the Category code of `Nonspacing Mark`_ (``Mn``).

`Regional Indicator`_ symbols (`U+1F1E6`_ through `U+1F1FF`_). Though
classified as Neutral in `EastAsianWidth.txt`_, terminals universally render
these as double-width. A consecutive pair of Regional Indicators forms a flag
emoji and is measured as width 2 total (first indicator is 2, second is 0).

`Emoji Modifier`_ Fitzpatrick symbols (`U+1F3FB`_ through `U+1F3FF`_) when
measured standalone (not following an emoji base character). When following
an emoji base, they combine with the base and add 0 to total width.

Any characters of `Modifier Symbol`_ category, ``'Sk'`` where ``'FULLWIDTH'`` is
present in comment of `UnicodeData.txt`_, aprox. 3 characters.

Any character with `U+FE0F`_ (Variation Selector 16) defined as ``emoji style``
in `emoji-variation-sequences.txt`_, per `UTS #51`_ and `Unicode Standard
Section 23.4`_: VS16 adds 1 cell to the narrow character it directly follows,
making the pair width 2. Wide characters are unchanged.

Any character with `U+FE0E`_ (Variation Selector 15) defined as ``text style``
in `emoji-variation-sequences.txt`_, per `UTS #51`_ and `Unicode Standard
Section 23.4`_: VS15 subtracts 1 cell from the wide character it directly
follows, making the pair width 1. Narrow characters are unchanged.

Any character of non-zero width followed by an ``Mc`` (`Spacing Combining Mark`_)
character when measured in sequence by :func:`wcwidth.wcswidth` or
:func:`wcwidth.width`. The ``Mc`` character caps the cluster width at 2,
reflecting its *positive advance width* as defined in `General Category`_
(Table 4-4). Zero-width combining marks (``Mn``) between the base character
and the ``Mc`` do not break the association. For example, a consonant followed
by a Nukta (``Mn``) and then a vowel sign (``Mc``) is measured as a cluster of
width 2.

Any grapheme cluster width is limited to 2 cells since 0.8.0, `PR #224`_.

Virama Conjunct Formation
-------------------------

In `Brahmic scripts`_, `IndicSyllabicCategory.txt`_ defines two categories
that trigger `conjunct`_ formation between consonants: `Virama`_ ("may act
as a Pure_Killer or Invisible_Stacker depending on context") and
`Invisible_Stacker`_ ("not visible by itself; causes conjunct formation
or consonant stacking", the "only as consonant stackers" category
described in the Virama section header).

- A ``Virama`` contributes 0 width.
- Most viramas have category ``Mn``, but six have category ``Mc``
  (`Spacing Combining Mark`_): these are recognised as viramas first,
  not as ``Mc``, so they begin a conjunct rather than capping the cluster.
- A ``Consonant`` immediately following a ``Virama`` adds its width to the
  current grapheme cluster.
- The cluster total is capped at 2 cells since 0.8.0, `PR #224`_.
- ``Mn`` marks do not break conjunct context within the same `aksara`_.
- ZWJ (`U+200D`_) after a virama is consumed without breaking conjunct state,
  supporting explicit half-form requests (virama + ZWJ + consonant).

See also: `L2/2023/23107`_ "Proper Complex Script Support in Text Terminals".

.. _`Hyperlinks in Terminal Emulators`: https://gist.github.com/egmontkob/eb114294efbcd5adb1944c9f3cb5feda
.. _`Kitty Text Sizing Protocol`: https://sw.kovidgoyal.net/kitty/text-sizing-protocol/
.. _`XTerm Control Sequences`: https://invisible-island.net/xterm/ctlseqs/ctlseqs.html
.. _`U+0000`: https://codepoints.net/U+0000
.. _`U+0001`: https://codepoints.net/U+0001
.. _`U+001F`: https://codepoints.net/U+001F
.. _`U+007F`: https://codepoints.net/U+007F
.. _`U+00A0`: https://codepoints.net/U+00A0
.. _`U+00AD`: https://codepoints.net/U+00AD
.. _`U+1160`: https://codepoints.net/U+1160
.. _`U+11FF`: https://codepoints.net/U+11FF
.. _`U+200D`: https://codepoints.net/U+200D
.. _`U+2028`: https://codepoints.net/U+2028
.. _`U+2029`: https://codepoints.net/U+2029
.. _`U+D7B0`: https://codepoints.net/U+D7B0
.. _`U+FE0F`: https://codepoints.net/U+FE0F
.. _`U+FE0E`: https://codepoints.net/U+FE0E
.. _`U+115F`: https://codepoints.net/U+115F
.. _`DerivedGeneralCategory.txt`: https://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt
.. _`DerivedCoreProperties.txt`: https://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt
.. _`EastAsianWidth.txt`: https://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt
.. _`emoji-variation-sequences.txt`: https://www.unicode.org/Public/UCD/latest/ucd/emoji/emoji-variation-sequences.txt
.. _`Prepended_Concatenation_Mark`: https://www.unicode.org/reports/tr44/#Prepended_Concatenation_Mark
.. _`Default_Ignorable_Code_Point`: https://www.unicode.org/reports/tr44/#Default_Ignorable_Code_Point
.. _`General Category`: https://www.unicode.org/reports/tr44/#General_Category
.. _`Spacing Combining Mark`: https://www.unicode.org/versions/latest/core-spec/chapter-4/#G134153
.. _`Enclosing Mark`: https://www.unicode.org/versions/latest/core-spec/chapter-4/#G134153
.. _`Format`: https://www.unicode.org/versions/latest/core-spec/chapter-4/#G134153
.. _`Modifier Symbol`: https://www.unicode.org/versions/latest/core-spec/chapter-4/#G134153
.. _`Hangul Jamo`: https://www.unicode.org/charts/PDF/U1100.pdf
.. _`U+D7FF`: https://codepoints.net/U+D7FF
.. _`UnicodeData.txt`: https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
.. _`East Asian`: https://www.unicode.org/reports/tr11/
.. _`Unicode Standard Annex #11`: https://www.unicode.org/reports/tr11/
.. _`U+1F1E6`: https://codepoints.net/U+1F1E6
.. _`U+1F1FF`: https://codepoints.net/U+1F1FF
.. _`U+1F3FB`: https://codepoints.net/U+1F3FB
.. _`U+1F3FF`: https://codepoints.net/U+1F3FF
.. _`Regional Indicator`: https://www.unicode.org/charts/PDF/U1F100.pdf
.. _`Emoji Modifier`: https://unicode.org/reports/tr51/#Emoji_Modifiers
.. _`Extended_Pictographic`: https://www.unicode.org/reports/tr51/#def_extended_pictographic
.. _`UTS #51`: https://www.unicode.org/reports/tr51/
.. _`Nonspacing Mark`: https://www.unicode.org/versions/latest/core-spec/chapter-4/#G134153
.. _`IndicSyllabicCategory.txt`: https://www.unicode.org/Public/UCD/latest/ucd/IndicSyllabicCategory.txt
.. _`Indic_Syllabic_Category`: https://www.unicode.org/reports/tr44/#Indic_Syllabic_Category
.. _`Invisible_Stacker`: https://www.unicode.org/Public/UCD/latest/ucd/IndicSyllabicCategory.txt
.. _`Brahmic scripts`: https://en.wikipedia.org/wiki/Brahmic_scripts
.. _`Virama`: https://www.unicode.org/glossary/#virama
.. _`conjunct`: https://www.unicode.org/glossary/#consonant_conjunct
.. _`aksara`: https://www.unicode.org/glossary/#aksara
.. _`L2/2023/23107`: https://www.unicode.org/L2/L2023/23107-terminal-suppt.pdf
.. _`Unicode Standard Annex #29`: https://www.unicode.org/reports/tr29/
.. _`Unicode Standard Section 23.4`: https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-23/#G77993
.. _`uncodedata.iter_graphemes()`: https://docs.python.org/3.15/library/unicodedata.html#unicodedata.iter_graphemes
.. _`PR #224`: https://github.com/jquast/wcwidth/pull/224