PEP 680 – tomllib: Support for Parsing TOML in the Standard Library
- PEP
- 680
- Title
- tomllib: Support for Parsing TOML in the Standard Library
- Author
- Taneli Hukkinen, Shantanu Jain <hauntsaninja at gmail.com>
- Sponsor
- Petr Viktorin <encukou at gmail.com>
- Discussions-To
- https://discuss.python.org/t/13040
- Status
- Accepted
- Type
- Standards Track
- Created
- 01-Jan-2022
- Python-Version
- 3.11
- Post-History
- 11-Jan-2022
- Resolution
- Python-Dev
Contents
- Abstract
- Motivation
- Rationale
- Specification
- Maintenance Implications
- Backwards Compatibility
- Security Implications
- How to Teach This
- Reference Implementation
- Rejected Ideas
- Previous Discussion
- Appendix A: Differences between proposed API and
toml
- Copyright
Abstract
This PEP proposes adding the tomllib
module to the standard library for
parsing TOML (Tom’s Obvious Minimal Language,
https://toml.io).
Motivation
TOML is the format of choice for Python packaging, as evidenced by PEP 517, PEP 518 and PEP 621. This creates a bootstrapping problem for Python build tools, forcing them to vendor a TOML parsing package or employ other undesirable workarounds, and causes serious issues for repackagers and other downstream consumers. Including TOML support in the standard library would neatly solve all of these issues.
Further, many Python tools are now configurable via TOML, such as
black
, mypy
, pytest
, tox
, pylint
and isort
.
Many that are not, such as flake8
, cite the lack of standard library
support as a main reason why.
Given the special place TOML already has in the Python ecosystem, it makes sense
for it to be an included battery.
Finally, TOML as a format is increasingly popular (for the reasons
outlined in PEP 518), with various Python TOML libraries having about
2000 reverse dependencies on PyPI (for comparison, requests
has about
28000 reverse dependencies). Hence, this is likely to be a generally useful
addition, even looking beyond the needs of Python packaging and related tools.
Rationale
This PEP proposes basing the standard library support for reading TOML on the
third-party library tomli
(github.com/hukkin/tomli).
Many projects have recently switched to using tomli
, such as pip
,
build
, pytest
, mypy
, black
, flit
, coverage
,
setuptools-scm
and cibuildwheel
.
tomli
is actively maintained and well-tested. It is about 800 lines
of code with 100% test coverage, and passes all tests in the
proposed official TOML compliance test suite, as well as
the more established BurntSushi/toml-test suite.
Specification
A new module tomllib
will be added to the Python standard library,
exposing the following public functions:
def load(
fp: SupportsRead[bytes],
/,
*,
parse_float: Callable[[str], Any] = ...,
) -> dict[str, Any]: ...
def loads(
s: str,
/,
*,
parse_float: Callable[[str], Any] = ...,
) -> dict[str, Any]: ...
tomllib.load
deserializes a binary file-like object containing a
TOML document to a Python dict
.
The fp
argument must have a read()
method with the same API as
io.RawIOBase.read()
.
tomllib.loads
deserializes a str
instance containing a TOML document
to a Python dict
.
The parse_float
argument is a callable object that takes as input the
original string representation of a TOML float, and returns a corresponding
Python object (similar to parse_float
in json.load
).
For example, the user may pass a function returning a decimal.Decimal
,
for use cases where exact precision is important. By default, TOML floats
are parsed as instances of the Python float
type.
The returned object contains only basic Python objects (str
, int
,
bool
, float
, datetime.{datetime,date,time}
, list
, dict
with
string keys), and the results of parse_float
.
tomllib.TOMLDecodeError
is raised in the case of invalid TOML.
Note that this PEP does not propose tomllib.dump
or tomllib.dumps
functions; see Including an API for writing TOML for details.
Maintenance Implications
Stability of TOML
The release of TOML 1.0.0 in January 2021 indicates the TOML format should now be officially considered stable. Empirically, TOML has proven to be a stable format even prior to the release of TOML 1.0.0. From the changelog, we can see that TOML has had no major changes since April 2020, and has had two releases in the past five years (2017-2021).
In the event of changes to the TOML specification, we can treat minor revisions as bug fixes and update the implementation in place. In the event of major breaking changes, we should preserve support for TOML 1.x.
Maintainability of proposed implementation
The proposed implementation (tomli
) is pure Python, well tested and
weighs in at under 1000 lines of code. It is minimalist, offering a smaller API
surface area than other TOML implementations.
The author of tomli
is willing to help integrate tomli
into the standard
library and help maintain it, as per this post.
Furthermore, Python core developer Petr Viktorin has indicated a willingness
to maintain a read API, as per this post.
Rewriting the parser in C is not deemed necessary at this time. It is rare for TOML parsing to be a bottleneck in applications, and users with higher performance needs can use a third-party library (as is already often the case with JSON, despite Python offering a standard library C-extension module).
TOML support a slippery slope for other things
As discussed in the Motivation section, TOML holds a special place in the
Python ecosystem, for reading PEP 518 pyproject.toml
packaging
and tool configuration files.
This chief reason to include TOML in the standard library does not apply to
other formats, such as YAML or MessagePack.
In addition, the simplicity of TOML distinguishes it from other formats like YAML, which are highly complicated to construct and parse.
An API for writing TOML may, however, be added in a future PEP.
Backwards Compatibility
This proposal has no backwards compatibility issues within the standard
library, as it describes a new module.
Any existing third-party module named tomllib
will break, as
import tomllib
will import the standard library module.
However, tomllib
is not registered on PyPI, so it is unlikely that any
module with this name is widely used.
Note that we avoid using the more straightforward name toml
to avoid
backwards compatibility implications for users who have pinned versions of the
current toml
PyPI package.
For more details, see the Alternative names for the module section.
Security Implications
Errors in the implementation could cause potential security issues. However, the parser’s output is limited to simple data types; inability to load arbitrary classes avoids security issues common in more “powerful” formats like pickle and YAML. Also, the implementation will be in pure Python, which reduces security issues endemic to C, such as buffer overflows.
How to Teach This
The API of tomllib
mimics that of other well-established file format
libraries, such as json
and pickle
. The lack of a dump
function will
be explained in the documentation, with a link to relevant third-party libraries
(e.g. tomlkit
, tomli-w
, pytomlpp
).
Reference Implementation
The proposed implementation can be found at https://github.com/hukkin/tomli
Rejected Ideas
Basing on another TOML implementation
Several potential alternative implementations exist:
tomlkit
is well established, actively maintained and supports TOML 1.0.0. An important difference is thattomlkit
supports style roundtripping. As a result, it has a more complex API and implementation (about 5x as much code astomli
). Its author does not believe thattomlkit
is a good choice for the standard library.toml
is a very widely used library. However, it is not actively maintained, does not support TOML 1.0.0 and has a number of known bugs. Its API is more complex than that oftomli
. It allows customising output style through a complicated encoder API, and some very limited and mostly unused functionality to preserve input style through an undocumented decoder API. For more details on its API differences from this PEP, refer to Appendix A.pytomlpp
is a Python wrapper for the C++ projecttoml++
. Pure Python libraries are easier to maintain than extension modules.rtoml
is a Python wrapper for the Rust projecttoml-rs
and hence has similar shortcomings topytomlpp
. In addition, it does not support TOML 1.0.0.- Writing an implementation from scratch. It’s unclear what we would get from
this;
tomli
meets our needs and the author is willing to help with its inclusion in the standard library.
Including an API for writing TOML
There are several reasons to not include an API for writing TOML.
The ability to write TOML is not needed for the use cases that motivate this PEP: core Python packaging tools, and projects that need to read TOML configuration files.
Use cases that involve editing an existing TOML file (as opposed to writing a
brand new one) are better served by a style preserving library. TOML is
intended as a human-readable and -editable configuration format, so it’s
important to preserve comments, formatting and other markup. This requires
a parser whose output includes style-related metadata, making it impractical
to output plain Python types like str
and dict
. Furthermore, it
substantially complicates the design of the API.
Even without considering style preservation, there are too many degrees of freedom in how to design a write API. For example, what default style (indentation, vertical and horizontal spacing, quotes, etc) should the library use for the output, and how much control should users be given over it? How should the library handle input and output validation? Should it support serialization of custom types, and if so, how? While there are reasonable options for resolving these issues, the nature of the standard library is such that we only get “one chance to get it right”.
Currently, no CPython core developers have expressed willingness to maintain a write API, or sponsor a PEP that includes one. Since it is hard to change or remove something in the standard library, it is safer to err on the side of exclusion for now, and potentially revisit this later.
Therefore, writing TOML is left to third-party libraries. If a good API and relevant use cases for it are found later, write support can be added in a future PEP.
Assorted API details
Types accepted as the first argument of tomllib.load
The toml
library on PyPI allows passing paths (and lists of path-like
objects, ignoring missing files and merging the documents into a single object)
to its load
function. However, allowing this here would be inconsistent
with the behavior of json.load
, pickle.load
and other standard library
functions. If we agree that consistency here is desirable,
allowing paths is out of scope for this PEP. This can easily and explicitly
be worked around in user code, or by using a third-party library.
The proposed API takes a binary file, while toml.load
takes a text file and
json.load
takes either. Using a binary file allows us to ensure UTF-8 is
the encoding used (ensuring correct parsing on platforms with other default
encodings, such as Windows), and avoid incorrectly parsing files containing
single carriage returns as valid TOML due to universal newlines in text mode.
Type accepted as the first argument of tomllib.loads
While tomllib.load
takes a binary file, tomllib.loads
takes
a text string. This may seem inconsistent at first.
Quoting the TOML v1.0.0 specification:
A TOML file must be a valid UTF-8 encoded Unicode document.
tomllib.loads
does not intend to load a TOML file, but rather the
document that the file stores. The most natural representation of
a Unicode document in Python is str
, not bytes
.
It is possible to add bytes
support in the future if needed, but
we are not aware of any use cases for it.
Controlling the type of mappings returned by tomllib.load[s]
The toml
library on PyPI accepts a _dict
argument in its load[s]
functions, which works similarly to the object_hook
argument in
json.load[s]
. There are several uses of _dict
found on
https://grep.app; however, almost all of them are passing
_dict=OrderedDict
, which should be unnecessary as of Python 3.7.
We found two instances of relevant use: in one case, a custom class was passed
for friendlier KeyErrors; in the other, the custom class had several
additional lookup and mutation methods (e.g. to help resolve dotted keys).
Such a parameter is not necessary for the core use cases outlined in the Motivation section. The absence of this can be pretty easily worked around using a wrapper class, transformer function, or a third-party library. Finally, support could be added later in a backward-compatible way.
Removing support for parse_float
in tomllib.load[s]
This option is not strictly necessary, since TOML floats should be implemented
as “IEEE 754 binary64 values”, which is equivalent to a Python float
on most
architectures.
The TOML specification uses the word “SHOULD”, however, implying a
recommendation that can be ignored for valid reasons. Parsing floats
differently, such as to decimal.Decimal
, allows users extra precision beyond
that promised by the TOML format. In the author of tomli
’s experience, this
is particularly useful in scientific and financial applications. This is also
useful for other cases that need greater precision, or where end-users include
non-developers who may not be aware of the limits of binary64 floats.
There are also niche architectures where the Python float
is not a IEEE 754
binary64 value. The parse_float
argument allows users to achieve correct
TOML semantics even on such architectures.
Alternative names for the module
Ideally, we would be able to use the toml
module name.
However, the toml
package on PyPI is widely used, so there are backward
compatibility concerns. Since the standard library takes precedence over third
party packages, libraries and applications who current depend on the toml
package would likely break when upgrading Python versions due to the many
API incompatibilities listed in Appendix A, even if they pin their
dependency versions.
To further clarify, applications with pinned dependencies are of greatest
concern here. Even if we were able to obtain control of the toml
PyPI
package name and repurpose it for a backport of the proposed new module,
we would still break users on new Python versions that included it in the
standard library, regardless of whether they have pinned an older version of
the existing toml
package. This is unfortunate, since pinning
would likely be a common response to breaking changes introduced by repurposing
the toml
package as a backport (that is incompatible with today’s toml
).
Finally, the toml
package on PyPI is not actively maintained, but as of
yet, efforts to request that the author add other maintainers
have been unsuccessful,
so action here would likely have to be taken without the author’s consent.
Instead, this PEP proposes the name tomllib
. This mirrors plistlib
and xdrlib
, two other file format modules in the standard library, as well
as other modules, such as pathlib
, contextlib
and graphlib
.
Other names considered but rejected include:
tomlparser
. This mirrorsconfigparser
, but is perhaps somewhat less appropriate if we include a write API in the future.tomli
. This assumes we usetomli
as the basis for implementation.toml
under some namespace, such asparser.toml
. However, this is awkward, especially so since existing parsing libraries likejson
,pickle
,xml
,html
etc. would not be included in the namespace.
Previous Discussion
Appendix A: Differences between proposed API and toml
This appendix covers the differences between the API proposed in this PEP and
that of the third-party package toml
. These differences are relevant to
understanding the amount of breakage we could expect if we used the toml
name for the standard library module, as well as to better understand the design
space. Note that this list might not be exhaustive.
- No proposed inclusion of a write API (no
toml.dump[s]
)This PEP currently proposes not including a write API; that is, there will be no equivalent of
toml.dump
ortoml.dumps
, as discussed at Including an API for writing TOML.If we included a write API, it would be relatively straightforward to convert most code that uses
toml
to the new standard library module (acknowledging that this is very different from a compatible API, as it would still require code changes).A significant fraction of
toml
users rely on this, based on comparing occurrences of “toml.load” to occurrences of “toml.dump”. - Different first argument of
toml.load
toml.load
has the following signature:def load( f: Union[SupportsRead[str], str, bytes, list[PathLike | str | bytes]], _dict: Type[MutableMapping[str, Any]] = ..., decoder: TomlDecoder = ..., ) -> MutableMapping[str, Any]: ...
This is quite different from the first argument proposed in this PEP:
SupportsRead[bytes]
.Recapping the reasons for this, previously mentioned at Types accepted as the first argument of tomllib.load:
- Allowing paths (and even lists of paths) as arguments is inconsistent with other similar functions in the standard library.
- Using
SupportsRead[bytes]
allows us to ensure UTF-8 is the encoding used, and avoid incorrectly parsing single carriage returns as valid TOML.
A significant fraction of
toml
users rely on this, based on manual inspection of occurrences of “toml.load”. - Errors
toml
raisesTomlDecodeError
, vs. the proposed PEP 8-compliantTOMLDecodeError
.A significant fraction of
toml
users rely on this, based on occurrences of “TomlDecodeError”. toml.load[s]
accepts a_dict
argumentDiscussed at Controlling the type of mappings returned by tomllib.load[s].
As mentioned there, almost all usage consists of
_dict=OrderedDict
, which is not necessary in Python 3.7 and later.toml.load[s]
support an undocumenteddecoder
argumentIt seems the intended use case is for an implementation of comment preservation. The information recorded is not sufficient to roundtrip the TOML document preserving style, the implementation has known bugs, the feature is undocumented and we could only find one instance of its use on https://grep.app.
The toml.TomlDecoder interface exposed is far from simple, containing nine methods.
Users are likely better served by a more complete implementation of style-preserving parsing and writing.
toml.dump[s]
support anencoder
argumentNote that we currently propose to not include a write API; however, if that were to change, these differences would likely become relevant.
The
encoder
argument enables two use cases:- control over how custom types should be serialized, and
- control over how output should be formatted.
The first is reasonable; however, we could only find two instances of this on https://grep.app. One of these two used this ability to add support for dumping
decimal.Decimal
, which a potential standard library implementation would support out of the box. If needed for other types, this use case could be well served by the equivalent of thedefault
argument injson.dump
.The second use case is enabled by allowing users to specify subclasses of toml.TomlEncoder and overriding methods to specify parts of the TOML writing process. The API consists of five methods and exposes substantial implementation detail.
There is some usage of the
encoder
API on https://grep.app; however, it appears to account for a tiny fraction of the overall usage oftoml
.- Timezones
toml
uses and exposes customtoml.tz.TomlTz
timezone objects. The proposed implementation usesdatetime.timezone
objects from the standard library.
Copyright
This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.
Source: https://github.com/python-discord/peps/blob/main/pep-0680.rst
Last modified: 2022-02-22 14:41:53 GMT