PEP 552 – Deterministic pycs
- PEP
- 552
- Title
- Deterministic pycs
- Author
- Benjamin Peterson <benjamin at python.org>
- Status
- Final
- Type
- Standards Track
- Created
- 04-Sep-2017
- Python-Version
- 3.7
- Post-History
- 07-Sep-2017
- Resolution
- Python-Dev
Contents
Abstract
This PEP proposes an extension to the pyc format to make it more deterministic.
Rationale
A reproducible build is one where the same byte-for-byte output is generated every time the same sources are built—even across different machines (naturally subject to the requirement that they have rather similar environments set up). Reproducibility is important for security. It is also a key concept in content-based build systems such as Bazel, which are most effective when the output files’ contents are a deterministic function of the input files’ contents.
The current Python pyc format is the marshaled code object of the module prefixed by a magic number, the source timestamp, and the source file size. The presence of a source timestamp means that a pyc is not a deterministic function of the input file’s contents—it also depends on volatile metadata, the mtime of the source. Thus, pycs are a barrier to proper reproducibility.
Distributors of Python code are currently stuck with the options of
- not distributing pycs and losing the caching advantages
- distributing pycs and losing reproducibility
- carefully giving all Python source files a deterministic timestamp (see, for example, https://github.com/python/cpython/pull/296)
- doing a complicated mixture of 1. and 2. like generating pycs at installation time
None of these options are very attractive. This PEP proposes allowing the timestamp to be replaced with a deterministic hash. The current timestamp invalidation method will remain the default, though. Despite its nondeterminism, timestamp invalidation works well for many workflows and usecases. The hash-based pyc format can impose the cost of reading and hashing every source file, which is more expensive than simply checking timestamps. Thus, for now, we expect it to be used mainly by distributors and power use cases.
(Note there are other problems [1] [2] we do not address here that can make pycs non-deterministic.)
Specification
The pyc header currently consists of 3 32-bit words. We will expand it to 4. The first word will continue to be the magic number, versioning the bytecode and pyc format. The second word, conceptually the new word, will be a bit field. The interpretation of the rest of the header and invalidation behavior of the pyc depends on the contents of the bit field.
If the bit field is 0, the pyc is a traditional timestamp-based pyc. I.e., the third and forth words will be the timestamp and file size respectively, and invalidation will be done by comparing the metadata of the source file with that in the header.
If the lowest bit of the bit field is set, the pyc is a hash-based pyc. We call
the second lowest bit the check_source
flag. Following the bit field is a
64-bit hash of the source file. We will use a SipHash with a hardcoded key of
the contents of the source file. Another fast hash like MD5 or BLAKE2 would
also work. We choose SipHash because Python already has a builtin implementation
of it from PEP 456, although an interface that allows picking the SipHash key
must be exposed to Python. Security of the hash is not a concern, though we pass
over completely-broken hashes like MD5 to ease auditing of Python in controlled
environments.
When Python encounters a hash-based pyc, its behavior depends on the setting of
the check_source
flag. If the check_source
flag is set, Python will
determine the validity of the pyc by hashing the source file and comparing the
hash with the expected hash in the pyc. If the pyc needs to be regenerated, it
will be regenerated as a hash-based pyc again with the check_source
flag
set.
For hash-based pycs with the check_source
unset, Python will simply load the
pyc without checking the hash of the source file. The expectation in this case
is that some external system (e.g., the local Linux distribution’s package
manager) is responsible for keeping pycs up to date, so Python itself doesn’t
have to check. Even when validation is disabled, the hash field should be set
correctly, so out-of-band consistency checkers can verify the up-to-dateness of
the pyc. Note also that the PEP 3147 edict that pycs without corresponding
source files not be loaded will still be enforced for hash-based pycs.
The programmatic APIs of py_compile
and compileall
will support
generation of hash-based pycs. Principally, py_compile
will define a new
enumeration corresponding to all the available pyc invalidation modules:
class PycInvalidationMode(Enum):
TIMESTAMP
CHECKED_HASH
UNCHECKED_HASH
py_compile.compile
, compileall.compile_dir
, and
compileall.compile_file
will all gain an invalidation_mode
parameter,
which accepts a value of the PycInvalidationMode
enumeration.
The compileall
tool will be extended with a command new option,
--invalidation-mode
to generate hash-based pycs with and without the
check_source
bit set. --invalidation-mode
will be a tristate option
taking values timestamp
(the default), checked-hash
, and
unchecked-hash
corresponding to the values of PycInvalidationMode
.
importlib.util
will be extended with a source_hash(source)
function that
computes the hash used by the pyc writing code for a bytestring source.
Runtime configuration of hash-based pyc invalidation will be facilitated by a
new --check-hash-based-pycs
interpreter option. This is a tristate option,
which may take 3 values: default
, always
, and never
. The default
value, default
, means the check_source
flag in hash-based pycs
determines invalidation as described above. always
causes the interpreter to
hash the source file for invalidation regardless of value of check_source
bit. never
causes the interpreter to always assume hash-based pycs are
valid. When --check-hash-based-pycs=never
is in effect, unchecked hash-based
pycs will be regenerated as unchecked hash-based pycs. Timestamp-based pycs are
unaffected by --check-hash-based-pycs
.
References
- [1]
- http://benno.id.au/blog/2013/01/15/python-determinism
- [2]
- http://bugzilla.opensuse.org/show_bug.cgi?id=1049186
Credits
The author would like to thank Gregory P. Smith, Christian Heimes, and Steve Dower for useful conversations on the topic of this PEP.
Copyright
This document has been placed in the public domain.
Source: https://github.com/python-discord/peps/blob/main/pep-0552.rst
Last modified: 2022-03-09 16:04:44 GMT