PEP 584 – Add Union Operators To dict
- PEP
- 584
- Title
- Add Union Operators To dict
- Author
- Steven D’Aprano <steve at pearwood.info>, Brandt Bucher <brandt at python.org>
- BDFL-Delegate
- Guido van Rossum <guido at python.org>
- Status
- Final
- Type
- Standards Track
- Created
- 01-Mar-2019
- Python-Version
- 3.9
- Post-History
- 01-Mar-2019, 16-Oct-2019, 02-Dec-2019, 04-Feb-2020, 17-Feb-2020
- Resolution
- Python-Dev
Contents
- Abstract
- Motivation
- Rationale
- Specification
- Reference Implementation
- Major Objections
- Rejected Ideas
- Examples
- IPython/zmq/ipkernel.py
- IPython/zmq/kernelapp.py
- matplotlib/backends/backend_svg.py
- matplotlib/delaunay/triangulate.py
- matplotlib/legend.py
- numpy/ma/core.py
- praw/internal.py
- pygments/lexer.py
- requests/sessions.py
- sphinx/domains/__init__.py
- sphinx/ext/doctest.py
- sphinx/ext/inheritance_diagram.py
- sphinx/highlighting.py
- sphinx/quickstart.py
- sympy/abc.py
- sympy/parsing/maxima.py
- sympy/printing/ccode.py and sympy/printing/fcode.py
- sympy/utilities/runtests.py
- Related Discussions
- Copyright
Abstract
This PEP proposes adding merge (|
) and update (|=
) operators
to the built-in dict
class.
Note
After this PEP was accepted, the decision was made to also implement the new operators for several other standard library mappings.
Motivation
The current ways to merge two dicts have several disadvantages:
dict.update
d1.update(d2)
modifies d1
in-place.
e = d1.copy(); e.update(d2)
is not an expression and needs a
temporary variable.
{**d1, **d2}
Dict unpacking looks ugly and is not easily discoverable. Few people would be able to guess what it means the first time they see it, or think of it as the “obvious way” to merge two dicts.
I’m sorry for PEP 448, but even if you know about**d
in simpler contexts, if you were to ask a typical Python user how to combine two dicts into a new one, I doubt many people would think of{**d1, **d2}
. I know I myself had forgotten about it when this thread started!
{**d1, **d2}
ignores the types of the mappings and always returns
a dict
. type(d1)({**d1, **d2})
fails for dict subclasses
such as defaultdict
that have an incompatible __init__
method.
collections.ChainMap
ChainMap
is unfortunately poorly-known and doesn’t qualify as
“obvious”. It also resolves duplicate keys in the opposite order to
that expected (“first seen wins” instead of “last seen wins”). Like
dict unpacking, it is tricky to get it to honor the desired subclass.
For the same reason, type(d1)(ChainMap(d2, d1))
fails for some
subclasses of dict.
Further, ChainMaps wrap their underlying dicts, so writes to the ChainMap will modify the original dict:
>>> d1 = {'spam': 1}
>>> d2 = {'eggs': 2}
>>> merged = ChainMap(d2, d1)
>>> merged['eggs'] = 999
>>> d2
{'eggs': 999}
dict(d1, **d2)
This “neat trick” is not well-known, and only works when d2
is
entirely string-keyed:
>>> d1 = {"spam": 1}
>>> d2 = {3665: 2}
>>> dict(d1, **d2)
Traceback (most recent call last):
...
TypeError: keywords must be strings
Rationale
The new operators will have the same relationship to the
dict.update
method as the list concatenate (+
) and extend
(+=
) operators have to list.extend
. Note that this is
somewhat different from the relationship that |
/|=
have with
set.update
; the authors have determined that allowing the in-place
operator to accept a wider range of types (as list
does) is a more
useful design, and that restricting the types of the binary operator’s
operands (again, as list
does) will help avoid silent errors
caused by complicated implicit type casting on both sides.
Key conflicts will be resolved by keeping the rightmost value. This
matches the existing behavior of similar dict
operations, where
the last seen value always wins:
{'a': 1, 'a': 2}
{**d, **e}
d.update(e)
d[k] = v
{k: v for x in (d, e) for (k, v) in x.items()}
All of the above follow the same rule. This PEP takes the position
that this behavior is simple, obvious, usually the behavior we want,
and should be the default behavior for dicts. This means that dict
union is not commutative; in general d | e != e | d
.
Similarly, the iteration order of the key-value pairs in the dictionary will follow the same semantics as the examples above, with each newly added key (and its value) being appended to the current sequence.
Specification
Dict union will return a new dict
consisting of the left operand
merged with the right operand, each of which must be a dict
(or an
instance of a dict
subclass). If a key appears in both operands,
the last-seen value (i.e. that from the right-hand operand) wins:
>>> d = {'spam': 1, 'eggs': 2, 'cheese': 3}
>>> e = {'cheese': 'cheddar', 'aardvark': 'Ethel'}
>>> d | e
{'spam': 1, 'eggs': 2, 'cheese': 'cheddar', 'aardvark': 'Ethel'}
>>> e | d
{'cheese': 3, 'aardvark': 'Ethel', 'spam': 1, 'eggs': 2}
The augmented assignment version operates in-place:
>>> d |= e
>>> d
{'spam': 1, 'eggs': 2, 'cheese': 'cheddar', 'aardvark': 'Ethel'}
Augmented assignment behaves identically to the update
method
called with a single positional argument, so it also accepts anything
implementing the Mapping protocol (more specifically, anything with
the keys
and __getitem__
methods) or iterables of key-value
pairs. This is analogous to list +=
and list.extend
, which
accept any iterable, not just lists. Continued from above:
>>> d | [('spam', 999)]
Traceback (most recent call last):
...
TypeError: can only merge dict (not "list") to dict
>>> d |= [('spam', 999)]
>>> d
{'spam': 999, 'eggs': 2, 'cheese': 'cheddar', 'aardvark': 'Ethel'}
When new keys are added, their order matches their order within the right-hand mapping, if any exists for its type.
Reference Implementation
One of the authors has written a C implementation.
An approximate pure-Python implementation is:
def __or__(self, other):
if not isinstance(other, dict):
return NotImplemented
new = dict(self)
new.update(other)
return new
def __ror__(self, other):
if not isinstance(other, dict):
return NotImplemented
new = dict(other)
new.update(self)
return new
def __ior__(self, other):
dict.update(self, other)
return self
Major Objections
Dict Union Is Not Commutative
Union is commutative, but dict union will not be (d | e != e | d
).
Response
There is precedent for non-commutative unions in Python:
>>> {0} | {False}
{0}
>>> {False} | {0}
{False}
While the results may be equal, they are distinctly different. In
general, a | b
is not the same operation as b | a
.
Dict Union Will Be Inefficient
Giving a pipe operator to mappings is an invitation to writing code
that doesn’t scale well. Repeated dict union is inefficient:
d | e | f | g | h
creates and destroys three temporary mappings.
Response
The same argument applies to sequence concatenation.
Sequence concatenation grows with the total number of items in the sequences, leading to O(N**2) (quadratic) performance. Dict union is likely to involve duplicate keys, so the temporary mappings will not grow as fast.
Just as it is rare for people to concatenate large numbers of lists or
tuples, the authors of this PEP believe that it will be rare for
people to merge large numbers of dicts. collections.Counter
is a
dict subclass that supports many operators, and there are no known
examples of people having performance issues due to combining large
numbers of Counters. Further, a survey of the standard library by the
authors found no examples of merging more than two dicts, so this is
unlikely to be a performance problem in practice… “Everything is
fast for small enough N”.
If one expects to be merging a large number of dicts where performance is an issue, it may be better to use an explicit loop and in-place merging:
new = {}
for d in many_dicts:
new |= d
Dict Union Is Lossy
Dict union can lose data (values may disappear); no other form of union is lossy.
Response
It isn’t clear why the first part of this argument is a problem.
dict.update()
may throw away values, but not keys; that is
expected behavior, and will remain expected behavior regardless of
whether it is spelled as update()
or |
.
Other types of union are also lossy, in the sense of not being
reversible; you cannot get back the two operands given only the union.
a | b == 365
… what are a
and b
?
Only One Way To Do It
Dict union will violate the Only One Way koan from the Zen.
Response
There is no such koan. “Only One Way” is a calumny about Python originating long ago from the Perl community.
More Than One Way To Do It
Okay, the Zen doesn’t say that there should be Only One Way To Do It. But it does have a prohibition against allowing “more than one way to do it”.
Response
There is no such prohibition. The “Zen of Python” merely expresses a preference for “only one obvious way”:
There should be one-- and preferably only one --obvious way to do
it.
The emphasis here is that there should be an obvious way to do “it”. In the case of dict update operations, there are at least two different operations that we might wish to do:
- Update a dict in place: The Obvious Way is to use the
update()
method. If this proposal is accepted, the|=
augmented assignment operator will also work, but that is a side-effect of how augmented assignments are defined. Which you choose is a matter of taste. - Merge two existing dicts into a third, new dict: This PEP proposes
that the Obvious Way is to use the
|
merge operator.
In practice, this preference for “only one way” is frequently violated
in Python. For example, every for
loop could be re-written as a
while
loop; every if
block could be written as an if
/
else
block. List, set and dict comprehensions could all be
replaced by generator expressions. Lists offer no fewer than five
ways to implement concatenation:
- Concatenation operator:
a + b
- In-place concatenation operator:
a += b
- Slice assignment:
a[len(a):] = b
- Sequence unpacking:
[*a, *b]
- Extend method:
a.extend(b)
We should not be too strict about rejecting useful functionality because it violates “only one way”.
Dict Union Makes Code Harder To Understand
Dict union makes it harder to tell what code means. To paraphrase the
objection rather than quote anyone in specific: “If I see
spam | eggs
, I can’t tell what it does unless I know what spam
and eggs
are”.
Response
This is very true. But it is equally true today, where the use of the
|
operator could mean any of:
int
/bool
bitwise-orset
/frozenset
union- any other overloaded operation
Adding dict union to the set of possibilities doesn’t seem to make
it harder to understand the code. No more work is required to
determine that spam
and eggs
are mappings than it would take
to determine that they are sets, or integers. And good naming
conventions will help:
flags |= WRITEABLE # Probably numeric bitwise-or.
DO_NOT_RUN = WEEKENDS | HOLIDAYS # Probably set union.
settings = DEFAULT_SETTINGS | user_settings | workspace_settings # Probably dict union.
What About The Full set
API?
dicts are “set like”, and should support the full collection of set
operators: |
, &
, ^
, and -
.
Response
This PEP does not take a position on whether dicts should support the full collection of set operators, and would prefer to leave that for a later PEP (one of the authors is interested in drafting such a PEP). For the benefit of any later PEP, a brief summary follows.
Set symmetric difference (^
) is obvious and natural. For example,
given two dicts:
d1 = {"spam": 1, "eggs": 2}
d2 = {"ham": 3, "eggs": 4}
the symmetric difference d1 ^ d2
would be
{"spam": 1, "ham": 3}
.
Set difference (-
) is also obvious and natural, and an earlier
version of this PEP included it in the proposal. Given the dicts
above, we would have d1 - d2
be {"spam": 1}
and d2 - d1
be
{"ham": 3}
.
Set intersection (&
) is a bit more problematic. While it is easy
to determine the intersection of keys in two dicts, it is not clear
what to do with the values. Given the two dicts above, it is
obvious that the only key of d1 & d2
must be "eggs"
. “Last
seen wins”, however, has the advantage of consistency with other dict
operations (and the proposed union operators).
What About Mapping
And MutableMapping
?
collections.abc.Mapping
and collections.abc.MutableMapping
should define |
and |=
, so subclasses could just inherit the
new operators instead of having to define them.
Response
There are two primary reasons why adding the new operators to these classes would be problematic:
- Currently, neither defines a
copy
method, which would be necessary for|
to create a new instance. - Adding
|=
toMutableMapping
(or acopy
method toMapping
) would create compatibility issues for virtual subclasses.
Rejected Ideas
Rejected Semantics
There were at least four other proposed solutions for handling conflicting keys. These alternatives are left to subclasses of dict.
Raise
It isn’t clear that this behavior has many use-cases or will be often
useful, but it will likely be annoying as any use of the dict union
operator would have to be guarded with a try
/except
clause.
Add The Values (As Counter Does, with +
)
Too specialised to be used as the default behavior.
Leftmost Value (First-Seen) Wins
It isn’t clear that this behavior has many use-cases. In fact, one can simply reverse the order of the arguments:
d2 | d1 # d1 merged with d2, keeping existing values in d1
Concatenate Values In A List
This is likely to be too specialised to be the default. It is not clear what to do if the values are already lists:
{'a': [1, 2]} | {'a': [3, 4]}
Should this give {'a': [1, 2, 3, 4]}
or
{'a': [[1, 2], [3, 4]]}
?
Rejected Alternatives
Use The Addition Operator
This PEP originally started life as a proposal for dict addition,
using the +
and +=
operator. That choice proved to be
exceedingly controversial, with many people having serious objections
to the choice of operator. For details, see previous versions of the
PEP and the mailing list discussions.
Use The Left Shift Operator
The <<
operator didn’t seem to get much support on Python-Ideas,
but no major objections either. Perhaps the strongest objection was
Chris Angelico’s comment
The “cuteness” value of abusing the operator to indicate information flow got old shortly after C++ did it.
Use A New Left Arrow Operator
Another suggestion was to create a new operator <-
. Unfortunately
this would be ambiguous, d <- e
could mean d merge e
or
d less-than minus e
.
Use A Method
A dict.merged()
method would avoid the need for an operator at
all. One subtlety is that it would likely need slightly different
implementations when called as an unbound method versus as a bound
method.
As an unbound method, the behavior could be similar to:
def merged(cls, *mappings, **kw):
new = cls() # Will this work for defaultdict?
for m in mappings:
new.update(m)
new.update(kw)
return new
As a bound method, the behavior could be similar to:
def merged(self, *mappings, **kw):
new = self.copy()
for m in mappings:
new.update(m)
new.update(kw)
return new
Advantages
- Arguably, methods are more discoverable than operators.
- The method could accept any number of positional and keyword arguments, avoiding the inefficiency of creating temporary dicts.
- Accepts sequences of
(key, value)
pairs like theupdate
method. - Being a method, it is easy to override in a subclass if you need alternative behaviors such as “first wins”, “unique keys”, etc.
Disadvantages
- Would likely require a new kind of method decorator which combined
the behavior of regular instance methods and
classmethod
. It would need to be public (but not necessarily a builtin) for those needing to override the method. There is a proof of concept. - It isn’t an operator. Guido discusses why operators are useful. For another viewpoint, see Nick Coghlan’s blog post.
Use a Function
Instead of a method, use a new built-in function merged()
. One
possible implementation could be something like this:
def merged(*mappings, **kw):
if mappings and isinstance(mappings[0], dict):
# If the first argument is a dict, use its type.
new = mappings[0].copy()
mappings = mappings[1:]
else:
# No positional arguments, or the first argument is a
# sequence of (key, value) pairs.
new = dict()
for m in mappings:
new.update(m)
new.update(kw)
return new
An alternative might be to forgo the arbitrary keywords, and take a single keyword parameter that specifies the behavior on collisions:
def merged(*mappings, on_collision=lambda k, v1, v2: v2):
# implementation left as an exercise to the reader
Advantages
- Most of the same advantages of the method solutions above.
- Doesn’t require a subclass to implement alternative behavior on collisions, just a function.
Disadvantages
- May not be important enough to be a builtin.
- Hard to override behavior if you need something like “first wins”, without losing the ability to process arbitrary keyword arguments.
Examples
The authors of this PEP did a survey of third party libraries for dictionary merging which might be candidates for dict union.
This is a cursory list based on a subset of whatever arbitrary third-party packages happened to be installed on one of the authors’ computers, and may not reflect the current state of any package. Also note that, while further (unrelated) refactoring may be possible, the rewritten version only adds usage of the new operators for an apples-to-apples comparison. It also reduces the result to an expression when it is efficient to do so.
IPython/zmq/ipkernel.py
Before:
aliases = dict(kernel_aliases)
aliases.update(shell_aliases)
After:
aliases = kernel_aliases | shell_aliases
IPython/zmq/kernelapp.py
Before:
kernel_aliases = dict(base_aliases)
kernel_aliases.update({
'ip' : 'KernelApp.ip',
'hb' : 'KernelApp.hb_port',
'shell' : 'KernelApp.shell_port',
'iopub' : 'KernelApp.iopub_port',
'stdin' : 'KernelApp.stdin_port',
'parent': 'KernelApp.parent',
})
if sys.platform.startswith('win'):
kernel_aliases['interrupt'] = 'KernelApp.interrupt'
kernel_flags = dict(base_flags)
kernel_flags.update({
'no-stdout' : (
{'KernelApp' : {'no_stdout' : True}},
"redirect stdout to the null device"),
'no-stderr' : (
{'KernelApp' : {'no_stderr' : True}},
"redirect stderr to the null device"),
})
After:
kernel_aliases = base_aliases | {
'ip' : 'KernelApp.ip',
'hb' : 'KernelApp.hb_port',
'shell' : 'KernelApp.shell_port',
'iopub' : 'KernelApp.iopub_port',
'stdin' : 'KernelApp.stdin_port',
'parent': 'KernelApp.parent',
}
if sys.platform.startswith('win'):
kernel_aliases['interrupt'] = 'KernelApp.interrupt'
kernel_flags = base_flags | {
'no-stdout' : (
{'KernelApp' : {'no_stdout' : True}},
"redirect stdout to the null device"),
'no-stderr' : (
{'KernelApp' : {'no_stderr' : True}},
"redirect stderr to the null device"),
}
matplotlib/backends/backend_svg.py
Before:
attrib = attrib.copy()
attrib.update(extra)
attrib = attrib.items()
After:
attrib = (attrib | extra).items()
matplotlib/delaunay/triangulate.py
Before:
edges = {}
edges.update(dict(zip(self.triangle_nodes[border[:,0]][:,1],
self.triangle_nodes[border[:,0]][:,2])))
edges.update(dict(zip(self.triangle_nodes[border[:,1]][:,2],
self.triangle_nodes[border[:,1]][:,0])))
edges.update(dict(zip(self.triangle_nodes[border[:,2]][:,0],
self.triangle_nodes[border[:,2]][:,1])))
Rewrite as:
edges = {}
edges |= zip(self.triangle_nodes[border[:,0]][:,1],
self.triangle_nodes[border[:,0]][:,2])
edges |= zip(self.triangle_nodes[border[:,1]][:,2],
self.triangle_nodes[border[:,1]][:,0])
edges |= zip(self.triangle_nodes[border[:,2]][:,0],
self.triangle_nodes[border[:,2]][:,1])
matplotlib/legend.py
Before:
hm = default_handler_map.copy()
hm.update(self._handler_map)
return hm
After:
return default_handler_map | self._handler_map
numpy/ma/core.py
Before:
_optinfo = {}
_optinfo.update(getattr(obj, '_optinfo', {}))
_optinfo.update(getattr(obj, '_basedict', {}))
if not isinstance(obj, MaskedArray):
_optinfo.update(getattr(obj, '__dict__', {}))
After:
_optinfo = {}
_optinfo |= getattr(obj, '_optinfo', {})
_optinfo |= getattr(obj, '_basedict', {})
if not isinstance(obj, MaskedArray):
_optinfo |= getattr(obj, '__dict__', {})
praw/internal.py
Before:
data = {'name': six.text_type(user), 'type': relationship}
data.update(kwargs)
After:
data = {'name': six.text_type(user), 'type': relationship} | kwargs
pygments/lexer.py
Before:
kwargs.update(lexer.options)
lx = lexer.__class__(**kwargs)
After:
lx = lexer.__class__(**(kwargs | lexer.options))
requests/sessions.py
Before:
merged_setting = dict_class(to_key_val_list(session_setting))
merged_setting.update(to_key_val_list(request_setting))
After:
merged_setting = dict_class(to_key_val_list(session_setting)) | to_key_val_list(request_setting)
sphinx/domains/__init__.py
Before:
self.attrs = self.known_attrs.copy()
self.attrs.update(attrs)
After:
self.attrs = self.known_attrs | attrs
sphinx/ext/doctest.py
Before:
new_opt = code[0].options.copy()
new_opt.update(example.options)
example.options = new_opt
After:
example.options = code[0].options | example.options
sphinx/ext/inheritance_diagram.py
Before:
n_attrs = self.default_node_attrs.copy()
e_attrs = self.default_edge_attrs.copy()
g_attrs.update(graph_attrs)
n_attrs.update(node_attrs)
e_attrs.update(edge_attrs)
After:
g_attrs |= graph_attrs
n_attrs = self.default_node_attrs | node_attrs
e_attrs = self.default_edge_attrs | edge_attrs
sphinx/highlighting.py
Before:
kwargs.update(self.formatter_args)
return self.formatter(**kwargs)
After:
return self.formatter(**(kwargs | self.formatter_args))
sphinx/quickstart.py
Before:
d2 = DEFAULT_VALUE.copy()
d2.update(dict(("ext_"+ext, False) for ext in EXTENSIONS))
d2.update(d)
d = d2
After:
d = DEFAULT_VALUE | dict(("ext_"+ext, False) for ext in EXTENSIONS) | d
sympy/abc.py
Before:
clash = {}
clash.update(clash1)
clash.update(clash2)
return clash1, clash2, clash
After:
return clash1, clash2, clash1 | clash2
sympy/parsing/maxima.py
Before:
dct = MaximaHelpers.__dict__.copy()
dct.update(name_dict)
obj = sympify(str, locals=dct)
After:
obj = sympify(str, locals=MaximaHelpers.__dict__|name_dict)
sympy/printing/ccode.py and sympy/printing/fcode.py
Before:
self.known_functions = dict(known_functions)
userfuncs = settings.get('user_functions', {})
self.known_functions.update(userfuncs)
After:
self.known_functions = known_functions | settings.get('user_functions', {})
sympy/utilities/runtests.py
Before:
globs = globs.copy()
if extraglobs is not None:
globs.update(extraglobs)
After:
globs = globs | (extraglobs if extraglobs is not None else {})
The above examples show that sometimes the |
operator leads to a
clear increase in readability, reducing the number of lines of code
and improving clarity. However other examples using the |
operator lead to long, complex single expressions, possibly well over
the PEP 8 maximum line length of 80 columns. As with any other
language feature, the programmer should use their own judgement about
whether |
improves their code.
Copyright
This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.
Source: https://github.com/python-discord/peps/blob/main/pep-0584.rst
Last modified: 2022-01-21 11:03:51 GMT