PEP 488 – Elimination of PYO files
- PEP
- 488
- Title
- Elimination of PYO files
- Author
- Brett Cannon <brett at python.org>
- Status
- Final
- Type
- Standards Track
- Created
- 20-Feb-2015
- Python-Version
- 3.5
- Post-History
- 06-Mar-2015, 13-Mar-2015, 20-Mar-2015
Abstract
This PEP proposes eliminating the concept of PYO files from Python. To continue the support of the separation of bytecode files based on their optimization level, this PEP proposes extending the PYC file name to include the optimization level in the bytecode repository directory when there are optimizations applied.
Rationale
As of today, bytecode files come in two flavours: PYC and PYO. A PYC
file is the bytecode file generated and read from when no
optimization level is specified at interpreter startup (i.e., -O
is not specified). A PYO file represents the bytecode file that is
read/written when any optimization level is specified (i.e., when
-O
or -OO
is specified). This means that while PYC
files clearly delineate the optimization level used when they were
generated – namely no optimizations beyond the peepholer – the same
is not true for PYO files. To put this in terms of optimization
levels and the file extension:
- 0:
.pyc
- 1 (
-O
):.pyo
- 2 (
-OO
):.pyo
The reuse of the .pyo
file extension for both level 1 and 2
optimizations means that there is no clear way to tell what
optimization level was used to generate the bytecode file. In terms
of reading PYO files, this can lead to an interpreter using a mixture
of optimization levels with its code if the user was not careful to
make sure all PYO files were generated using the same optimization
level (typically done by blindly deleting all PYO files and then
using the compileall module to compile all-new PYO files [1]).
This issue is only compounded when people optimize Python code beyond
what the interpreter natively supports, e.g., using the astoptimizer
project [2].
In terms of writing PYO files, the need to delete all PYO files
every time one either changes the optimization level they want to use
or are unsure of what optimization was used the last time PYO files
were generated leads to unnecessary file churn. The change proposed
by this PEP also allows for all optimization levels to be
pre-compiled for bytecode files ahead of time, something that is
currently impossible thanks to the reuse of the .pyo
file
extension for multiple optimization levels.
As for distributing bytecode-only modules, having to distribute both
.pyc
and .pyo
files is unnecessary for the common use-case
of code obfuscation and smaller file deployments. This means that
bytecode-only modules will only load from their non-optimized
.pyc
file name.
Proposal
To eliminate the ambiguity that PYO files present, this PEP proposes
eliminating the concept of PYO files and their accompanying .pyo
file extension. To allow for the optimization level to be unambiguous
as well as to avoid having to regenerate optimized bytecode files
needlessly in the __pycache__ directory, the optimization level
used to generate the bytecode file will be incorporated into the
bytecode file name. When no optimization level is specified, the
pre-PEP .pyc
file name will be used (i.e., no optimization level
will be specified in the file name). For example, a source file named
foo.py
in CPython 3.5 could have the following bytecode files
based on the interpreter’s optimization level (none, -O
, and
-OO
):
- 0:
foo.cpython-35.pyc
(i.e., no change) - 1:
foo.cpython-35.opt-1.pyc
- 2:
foo.cpython-35.opt-2.pyc
Currently bytecode file names are created by
importlib.util.cache_from_source()
, approximately using the
following expression defined by PEP 3147 [3], [4]:
'{name}.{cache_tag}.pyc'.format(name=module_name,
cache_tag=sys.implementation.cache_tag)
This PEP proposes to change the expression when an optimization level is specified to:
'{name}.{cache_tag}.opt-{optimization}.pyc'.format(
name=module_name,
cache_tag=sys.implementation.cache_tag,
optimization=str(sys.flags.optimize))
The “opt-” prefix was chosen so as to provide a visual separator from the cache tag. The placement of the optimization level after the cache tag was chosen to preserve lexicographic sort order of bytecode file names based on module name and cache tag which will not vary for a single interpreter. The “opt-” prefix was chosen over “o” so as to be somewhat self-documenting. The “opt-” prefix was chosen over “O” so as to not have any confusion in case “0” was the leading prefix of the optimization level.
A period was chosen over a hyphen as a separator so as to distinguish clearly that the optimization level is not part of the interpreter version as specified by the cache tag. It also lends to the use of the period in the file name to delineate semantically different concepts.
For example, if -OO
had been passed to the interpreter then
instead of importlib.cpython-35.pyo
the file name would be
importlib.cpython-35.opt-2.pyc
.
Leaving out the new opt-
tag when no optimization level is
applied should increase backwards-compatibility. This is also more
understanding of Python implementations which have no use for
optimization levels (e.g., PyPy[10]_).
It should be noted that this change in no way affects the performance
of import. Since the import system looks for a single bytecode file
based on the optimization level of the interpreter already and
generates a new bytecode file if it doesn’t exist, the introduction
of potentially more bytecode files in the __pycache__
directory
has no effect in terms of stat calls. The interpreter will continue
to look for only a single bytecode file based on the optimization
level and thus no increase in stat calls will occur.
The only potentially negative result of this PEP is the probable
increase in the number of .pyc
files and thus increase in storage
use. But for platforms where this is an issue,
sys.dont_write_bytecode
exists to turn off bytecode generation so
that it can be controlled offline.
Implementation
An implementation of this PEP is available [11].
importlib
As importlib.util.cache_from_source()
is the API that exposes
bytecode file paths as well as being directly used by importlib, it
requires the most critical change. As of Python 3.4, the function’s
signature is:
importlib.util.cache_from_source(path, debug_override=None)
This PEP proposes changing the signature in Python 3.5 to:
importlib.util.cache_from_source(path, debug_override=None, *, optimization=None)
The introduced optimization
keyword-only parameter will control
what optimization level is specified in the file name. If the
argument is None
then the current optimization level of the
interpreter will be assumed (including no optimization). Any argument
given for optimization
will be passed to str()
and must have
str.isalnum()
be true, else ValueError
will be raised (this
prevents invalid characters being used in the file name). If the
empty string is passed in for optimization
then the addition of
the optimization will be suppressed, reverting to the file name
format which predates this PEP.
It is expected that beyond Python’s own two optimization levels,
third-party code will use a hash of optimization names to specify the
optimization level, e.g.
hashlib.sha256(','.join(['no dead code', 'const folding'])).hexdigest()
.
While this might lead to long file names, it is assumed that most
users never look at the contents of the __pycache__ directory and so
this won’t be an issue.
The debug_override
parameter will be deprecated. A False
value will be equivalent to optimization=1
while a True
value will represent optimization=''
(a None
argument will
continue to mean the same as for optimization
). A
deprecation warning will be raised when debug_override
is given a
value other than None
, but there are no plans for the complete
removal of the parameter at this time (but removal will be no later
than Python 4).
The various module attributes for importlib.machinery which relate to
bytecode file suffixes will be updated [7]. The
DEBUG_BYTECODE_SUFFIXES
and OPTIMIZED_BYTECODE_SUFFIXES
will
both be documented as deprecated and set to the same value as
BYTECODE_SUFFIXES
(removal of DEBUG_BYTECODE_SUFFIXES
and
OPTIMIZED_BYTECODE_SUFFIXES
is not currently planned, but will be
not later than Python 4).
All various finders and loaders will also be updated as necessary, but updating the previous mentioned parts of importlib should be all that is required.
Rest of the standard library
The various functions exposed by the py_compile
and
compileall
functions will be updated as necessary to make sure
they follow the new bytecode file name semantics [6], [1]. The CLI
for the compileall
module will not be directly affected (the
-b
flag will be implicit as it will no longer generate .pyo
files when -O
is specified).
Compatibility Considerations
Any code directly manipulating bytecode files from Python 3.2 on
will need to consider the impact of this change on their code (prior
to Python 3.2 – including all of Python 2 – there was no
__pycache__ which already necessitates bifurcating bytecode file
handling support). If code was setting the debug_override
argument to importlib.util.cache_from_source()
then care will be
needed if they want the path to a bytecode file with an optimization
level of 2. Otherwise only code not using
importlib.util.cache_from_source()
will need updating.
As for people who distribute bytecode-only modules (i.e., use a
bytecode file instead of a source file), they will have to choose
which optimization level they want their bytecode files to be since
distributing a .pyo
file with a .pyc
file will no longer be
of any use. Since people typically only distribute bytecode files for
code obfuscation purposes or smaller distribution size then only
having to distribute a single .pyc
should actually be beneficial
to these use-cases. And since the magic number for bytecode files
changed in Python 3.5 to support PEP 465 there is no need to support
pre-existing .pyo
files [8].
Rejected Ideas
Completely dropping optimization levels from CPython
Some have suggested that instead of accommodating the various optimization levels in CPython, we should instead drop them entirely. The argument is that significant performance gains would occur from runtime optimizations through something like a JIT and not through pre-execution bytecode optimizations.
This idea is rejected for this PEP as that ignores the fact that there are people who do find the pre-existing optimization levels for CPython useful. It also assumes that no other Python interpreter would find what this PEP proposes useful.
Alternative formatting of the optimization level in the file name
Using the “opt-” prefix and placing the optimization level between the cache tag and file extension is not critical. All options which have been considered are:
importlib.cpython-35.opt-1.pyc
importlib.cpython-35.opt1.pyc
importlib.cpython-35.o1.pyc
importlib.cpython-35.O1.pyc
importlib.cpython-35.1.pyc
importlib.cpython-35-O1.pyc
importlib.O1.cpython-35.pyc
importlib.o1.cpython-35.pyc
importlib.1.cpython-35.pyc
These were initially rejected either because they would change the sort order of bytecode files, possible ambiguity with the cache tag, or were not self-documenting enough. An informal poll was taken and people clearly preferred the formatting proposed by the PEP [9]. Since this topic is non-technical and of personal choice, the issue is considered solved.
Embedding the optimization level in the bytecode metadata
Some have suggested that rather than embedding the optimization level of bytecode in the file name that it be included in the file’s metadata instead. This would mean every interpreter had a single copy of bytecode at any time. Changing the optimization level would thus require rewriting the bytecode, but there would also only be a single file to care about.
This has been rejected due to the fact that Python is often installed as a root-level application and thus modifying the bytecode file for modules in the standard library are always possible. In this situation integrators would need to guess at what a reasonable optimization level was for users for any/all situations. By allowing multiple optimization levels to co-exist simultaneously it frees integrators from having to guess what users want and allows users to utilize the optimization level they want.
References
- [1] (1, 2)
- The compileall module (https://docs.python.org/3/library/compileall.html#module-compileall)
- [2]
- The astoptimizer project (https://pypi.python.org/pypi/astoptimizer)
- [3]
importlib.util.cache_from_source()
(https://docs.python.org/3.5/library/importlib.html#importlib.util.cache_from_source)- [4]
- Implementation of
importlib.util.cache_from_source()
from CPython 3.4.3rc1 (https://hg.python.org/cpython/file/038297948389/Lib/importlib/_bootstrap.py#l437) - [6]
- The py_compile module (https://docs.python.org/3/library/compileall.html#module-compileall)
- [7]
- The importlib.machinery module (https://docs.python.org/3/library/importlib.html#module-importlib.machinery)
- [8]
importlib.util.MAGIC_NUMBER
(https://docs.python.org/3/library/importlib.html#importlib.util.MAGIC_NUMBER)- [9]
- Informal poll of file name format options on Google+ (https://plus.google.com/u/0/+BrettCannon/posts/fZynLNwHWGm)
- [10]
- The PyPy Project (http://pypy.org/)
- [11]
- Implementation of PEP 488 (http://bugs.python.org/issue23731)
Copyright
This document has been placed in the public domain.
Source: https://github.com/python-discord/peps/blob/main/pep-0488.txt
Last modified: 2022-03-09 16:04:44 GMT