PEP 540 – Add a new UTF-8 Mode
- PEP
- 540
- Title
- Add a new UTF-8 Mode
- Author
- Victor Stinner <vstinner at python.org>
- BDFL-Delegate
- INADA Naoki
- Status
- Final
- Type
- Standards Track
- Created
- 05-Jan-2016
- Python-Version
- 3.7
- Resolution
- Python-Dev
Abstract
Add a new “UTF-8 Mode” to enhance Python’s use of UTF-8. When UTF-8 Mode is active, Python will:
- use the
utf-8
encoding, regardless of the locale currently set by the current platform, and - change the
stdin
andstdout
error handlers tosurrogateescape
.
This mode is off by default, but is automatically activated when using the “POSIX” locale.
Add the -X utf8
command line option and PYTHONUTF8
environment
variable to control UTF-8 Mode.
Rationale
Locale encoding and UTF-8
Python 3.6 uses the locale encoding for filenames, environment variables, standard streams, etc. The locale encoding is inherited from the locale; the encoding and the locale are tightly coupled.
Many users inherit the ASCII encoding from the POSIX locale, aka the “C” locale, but are unable change the locale for various reasons. This encoding is very limited in term of Unicode support: any non-ASCII character is likely to cause trouble.
It isn’t always easy to get an accurate locale. Locales don’t get the
exact same name on different Linux distributions, FreeBSD, macOS, etc.
And some locales, like the recent C.UTF-8
locale, are only supported
by a few platforms. The current locale can even vary on the same
platform depending on context; for example, a SSH connection can use a
different encoding than the filesystem or local terminal encoding on the
same machine.
On the flip side, Python 3.6 is already using UTF-8 by default on macOS,
Android and Windows (PEP 529) for most functions – although
open()
is a notable exception here. UTF-8 is also the default
encoding of Python scripts, XML and JSON file formats. The Go
programming language
uses UTF-8 for all strings.
UTF-8 support is nearly ubiquitous for data read and written by modern platforms. It also has excellent support in Python. The problem is simply that the locale is frequently misconfigured. An obvious solution suggests itself: ignore the locale encoding and use UTF-8.
Passthough for undecodable bytes: surrogateescape
When decoding bytes from UTF-8 using the default strict
error
handler, Python 3 raises a UnicodeDecodeError
on the first
undecodable byte.
Unix command line tools like cat
or grep
and most Python 2
applications simply do not have this class of bugs: they don’t decode
data, but process data as a raw bytes sequence.
Python 3 already has a solution to behave like Unix tools and Python 2:
the surrogateescape
error handler (PEP 383). It allows processing
data as if it were bytes, but uses Unicode in practice; undecodable
bytes are stored as surrogate characters.
UTF-8 Mode sets the surrogateescape
error handler for stdin
and stdout
, since these streams as commonly associated to Unix
command line tools.
However, users have a different expectation on files. Files are expected
to be properly encoded, and Python is expected to fail early when
open()
is called with the wrong options, like opening a JPEG picture
in text mode. The open()
default error handler remains strict
for these reasons.
No change by default for best backward compatibility
While UTF-8 is perfect in most cases, sometimes the locale encoding is actually the best encoding.
This PEP changes the behaviour for the POSIX locale since this locale is usually equivalent to the ASCII encoding, whereas UTF-8 is a much better choice. It does not change the behaviour for other locales to prevent any risk or regression.
As users are responsible to enable explicitly the new UTF-8 Mode for these other locales, they are responsible for any potential mojibake issues caused by UTF-8 Mode.
Proposal
Add a new UTF-8 Mode to use the UTF-8 encoding, ignore the locale
encoding, and change stdin
and stdout
error handlers to
surrogateescape
.
Add the new -X utf8
command line option and PYTHONUTF8
environment variable. Users can explicitly activate UTF-8 Mode with the
command-line option -X utf8
or by setting the environment variable
PYTHONUTF8=1
.
This mode is disabled by default and enabled by the POSIX locale. Users
can explicitly disable UTF-8 Mode with the command-line option -X
utf8=0
or by setting the environment variable PYTHONUTF8=0
.
For standard streams, the PYTHONIOENCODING
environment variable has
priority over UTF-8 Mode.
On Windows, the PYTHONLEGACYWINDOWSFSENCODING
environment variable
(PEP 529) has the priority over UTF-8 Mode.
Effects of UTF-8 Mode:
sys.getfilesystemencoding()
returns'UTF-8'
.locale.getpreferredencoding()
returnsUTF-8
; its do_setlocale argument, and the locale encoding, are ignored.sys.stdin
andsys.stdout
error handler is set tosurrogateescape
.
Side effects:
open()
uses the UTF-8 encoding by default. However, it still uses thestrict
error handler by default.os.fsdecode()
andos.fsencode()
use the UTF-8 encoding.- Command line arguments, environment variables and filenames use the UTF-8 encoding.
Relationship with the locale coercion (PEP 538)
The POSIX locale enables the locale coercion (PEP 538) and the UTF-8 mode (PEP 540). When the locale coercion is enabled, enabling the UTF-8 mode has no additional effect.
The UTF-8 Mode has the same effect as locale coercion:
sys.getfilesystemencoding()
returns'UTF-8'
,locale.getpreferredencoding()
returnsUTF-8
, and- the
sys.stdin
andsys.stdout
error handlers are set tosurrogateescape
.
These changes only affect Python code. But the locale coercion has
additional effects: the LC_CTYPE
environment variable and the
LC_CTYPE
locale are set to a UTF-8 locale like C.UTF-8
. One side
effect is that non-Python code is also impacted by the locale coercion.
The two PEPs are complementary.
On platforms like Centos 7 where locale coercion is not supported, the POSIX locale only enables UTF-8 Mode. In this case, Python code uses the UTF-8 encoding and ignores the locale encoding, whereas non-Python code uses the locale encoding, which is usually ASCII for the POSIX locale.
While the UTF-8 Mode is supported on all platforms and can be enabled with any locale, the locale coercion is not supported by all platforms and is restricted to the POSIX locale.
The UTF-8 Mode has only an impact on Python child processes when the
PYTHONUTF8
environment variable is set to 1
, whereas the locale
coercion sets the LC_CTYPE
environment variables which impacts all
child processes.
The benefit of the locale coercion approach is that it helps ensure that encoding handling in binary extension modules and child processes is consistent with Python’s encoding handling. The upside of the UTF-8 Mode approach is that it allows an embedding application to change the interpreter’s behaviour without having to change the process global locale settings.
Backward Compatibility
The only backward incompatible change is that the POSIX locale now
enables the UTF-8 Mode by default: it will now use the UTF-8 encoding,
ignore the locale encoding, and change stdin
and stdout
error
handlers to surrogateescape
.
Annex: Encodings And Error Handlers
UTF-8 Mode changes the default encoding and error handler used by
open()
, os.fsdecode()
, os.fsencode()
, sys.stdin
,
sys.stdout
and sys.stderr
.
Encoding and error handler
Function | Default | UTF-8 Mode or POSIX locale |
---|---|---|
open() | locale/strict | UTF-8/strict |
os.fsdecode(), os.fsencode() | locale/surrogateescape | UTF-8/surrogateescape |
sys.stdin, sys.stdout | locale/strict | UTF-8/surrogateescape |
sys.stderr | locale/backslashreplace | UTF-8/backslashreplace |
By comparison, Python 3.6 uses:
Function | Default | POSIX locale |
---|---|---|
open() | locale/strict | locale/strict |
os.fsdecode(), os.fsencode() | locale/surrogateescape | locale/surrogateescape |
sys.stdin, sys.stdout | locale/strict | locale/surrogateescape |
sys.stderr | locale/backslashreplace | locale/backslashreplace |
Encoding and error handler on Windows
On Windows, the encodings and error handlers are different:
Function | Default | Legacy Windows FS encoding | UTF-8 Mode |
---|---|---|---|
open() | mbcs/strict | mbcs/strict | UTF-8/strict |
os.fsdecode(), os.fsencode() | UTF-8/surrogatepass | mbcs/replace | UTF-8/surrogatepass |
sys.stdin, sys.stdout | UTF-8/surrogateescape | UTF-8/surrogateescape | UTF-8/surrogateescape |
sys.stderr | UTF-8/backslashreplace | UTF-8/backslashreplace | UTF-8/backslashreplace |
By comparison, Python 3.6 uses:
Function | Default | Legacy Windows FS encoding |
---|---|---|
open() | mbcs/strict | mbcs/strict |
os.fsdecode(), os.fsencode() | UTF-8/surrogatepass | mbcs/replace |
sys.stdin, sys.stdout | UTF-8/surrogateescape | UTF-8/surrogateescape |
sys.stderr | UTF-8/backslashreplace | UTF-8/backslashreplace |
The “Legacy Windows FS encoding” is enabled by the
PYTHONLEGACYWINDOWSFSENCODING
environment variable.
If stdin and/or stdout is redirected to a pipe, sys.stdin
and/or
sys.output
uses mbcs
encoding by default rather than UTF-8.
But in UTF-8 Mode, sys.stdin
and sys.stdout
always use the UTF-8
encoding.
Note
There is no POSIX locale on Windows. The ANSI code page is used as the locale encoding, and this code page never uses the ASCII encoding.
Links
- bpo-29240: Implementation of the PEP 540: Add a new UTF-8 Mode
- PEP 538: “Coercing the legacy C locale to C.UTF-8”
- PEP 529: “Change Windows filesystem encoding to UTF-8”
- PEP 528: “Change Windows console encoding to UTF-8”
- PEP 383: “Non-decodable Bytes in System Character Interfaces”
Post History
- 2017-12: [Python-Dev] PEP 540: Add a new UTF-8 Mode
- 2017-04: [Python-Dev] Proposed BDFL Delegate update for PEPs 538 & 540 (assuming UTF-8 for *nix system boundaries)
- 2017-01: [Python-ideas] PEP 540: Add a new UTF-8 Mode
- 2017-01: bpo-28180: Implementation of the PEP 538: coerce C locale to C.utf-8 (msg284764)
- 2016-08-17: bpo-27781: Change sys.getfilesystemencoding() on Windows
to UTF-8 (msg272916)
– Victor proposed
-X utf8
for the PEP 529 (Change Windows filesystem encoding to UTF-8)
Version History
- Version 4:
locale.getpreferredencoding()
now returns'UTF-8'
in the UTF-8 Mode. - Version 3: The UTF-8 Mode does not change the
open()
default error handler (strict
) anymore, and the Strict UTF-8 Mode has been removed. - Version 2: Rewrite the PEP from scratch to make it much shorter and easier to understand.
- Version 1: First version posted to python-dev.
Copyright
This document has been placed in the public domain.
Source: https://github.com/python-discord/peps/blob/main/pep-0540.txt
Last modified: 2022-01-21 11:03:51 GMT