Feature or enhancement
Proposal:
Not sure if this qualifies as a feature or not; it's somewhere between "feature" and "proactive differential reduction work" 🙂
TL;DR: tarfile supports pax archives as a style of tar. pax has two kinds of extension block: "local" (typeflag x) extensions apply to the next member (file, directory, etc.), while "global" (typeflag g) extensions apply to every subsequent member unless overridden by a subsequent local extension (or another global extension). I propose tightening tarfile's acceptance of pax streams containing global extensions when those extensions affect local member state in a way that's incoherent. In particular, that means rejecting any global extension that contains a path, linkpath, or size record.
Problem statement
In practice local pax extensions are widely used, while global extensions are not widely used. This is because the interaction between the two is confusing and not well modeled by various tar parsers, meaning that mixing the two in the same stream can produce inconsistent results across implementations.
This is worsened by the fact that "global" extensions can modify states that they have no business modifying. For example, a global extension can set a path= or linkname= record, which affects all subsequent members that don't set their own path= or linkname= in a local extension. In effect this means that everything gets extracted with the same filename (or link target), which is not a coherent thing to encode in an archive.
Global extensions can also mess with tar framing itself: a global extension can set a size= record, which (per the spec) has similar overriding behavior. This in effect means that every subsequent ustar header's size gets ignored, which is similarly not coherent.
Current behavior
Right now tarfile handles both local and global extensions, and attempts to apply pax's last-wins record policy across them. This is correct per the pax standard, but is IMO overly permissive when it comes to records like path, linkname, and size (where a global record for any of those is unlikely to do what a user actually wants).
This is in contrast to other global records like uid and mtime -- setting these globally is a reasonable size optimization when they don't vary across members, and there's no significant differential risk with them (since they don't affect placement on disk or the parser's perceived size of each member).
Proposed behavior
I propose adding a hard error state whenever tarfile encounters a global pax extension that contains one of the risky records mentioned above (path, linkpath, or size).
I suspect this could be made the default without significant breakage risk, since these records are already virtually unheard of in global extensions (due to their incoherent/undesirable semantics).
I did a review of the sdists of the top 10,000 packages on PyPI, and found virtually no global extensions at all among those tars. Those that do have global extensions do not use the offending record types, so there would be no breakage risk (to them) with this change.
However, because this would be a restriction of existing behavior, it's difficult to say with absolute certainty that there's no breakage risk. So, we could make this an opt-in, either with a new kind of policy on the extraction APIs (or maybe as an addition to one/all of the current filters).
CCing some people who I suspect might be interested in this: @emmatyping @sethmlarson @thatch
Has this already been discussed elsewhere?
This is a minor feature, which does not need previous discussion elsewhere
Links to previous discussion of this feature:
I couldn't find any direct previous discussion of this. Some previous discussions of pax/gloal pax extensions include #149578, #136602, #136601, and #83869, but none are directly about rejecting tar streams that contain specific global pax records.
Apologies if I've missed another discussion, however!
Feature or enhancement
Proposal:
Not sure if this qualifies as a feature or not; it's somewhere between "feature" and "proactive differential reduction work" 🙂
TL;DR:
tarfilesupports pax archives as a style of tar. pax has two kinds of extension block: "local" (typeflag x) extensions apply to the next member (file, directory, etc.), while "global" (typeflag g) extensions apply to every subsequent member unless overridden by a subsequent local extension (or another global extension). I propose tighteningtarfile's acceptance of pax streams containing global extensions when those extensions affect local member state in a way that's incoherent. In particular, that means rejecting any global extension that contains apath,linkpath, orsizerecord.Problem statement
In practice local pax extensions are widely used, while global extensions are not widely used. This is because the interaction between the two is confusing and not well modeled by various tar parsers, meaning that mixing the two in the same stream can produce inconsistent results across implementations.
This is worsened by the fact that "global" extensions can modify states that they have no business modifying. For example, a global extension can set a
path=orlinkname=record, which affects all subsequent members that don't set their ownpath=orlinkname=in a local extension. In effect this means that everything gets extracted with the same filename (or link target), which is not a coherent thing to encode in an archive.Global extensions can also mess with tar framing itself: a global extension can set a
size=record, which (per the spec) has similar overriding behavior. This in effect means that every subsequent ustar header's size gets ignored, which is similarly not coherent.Current behavior
Right now tarfile handles both local and global extensions, and attempts to apply pax's last-wins record policy across them. This is correct per the pax standard, but is IMO overly permissive when it comes to records like
path,linkname, andsize(where a global record for any of those is unlikely to do what a user actually wants).This is in contrast to other global records like
uidandmtime-- setting these globally is a reasonable size optimization when they don't vary across members, and there's no significant differential risk with them (since they don't affect placement on disk or the parser's perceived size of each member).Proposed behavior
I propose adding a hard error state whenever
tarfileencounters a global pax extension that contains one of the risky records mentioned above (path,linkpath, orsize).I suspect this could be made the default without significant breakage risk, since these records are already virtually unheard of in global extensions (due to their incoherent/undesirable semantics).
I did a review of the sdists of the top 10,000 packages on PyPI, and found virtually no global extensions at all among those tars. Those that do have global extensions do not use the offending record types, so there would be no breakage risk (to them) with this change.
However, because this would be a restriction of existing behavior, it's difficult to say with absolute certainty that there's no breakage risk. So, we could make this an opt-in, either with a new kind of policy on the extraction APIs (or maybe as an addition to one/all of the current filters).
CCing some people who I suspect might be interested in this: @emmatyping @sethmlarson @thatch
Has this already been discussed elsewhere?
This is a minor feature, which does not need previous discussion elsewhere
Links to previous discussion of this feature:
I couldn't find any direct previous discussion of this. Some previous discussions of pax/gloal pax extensions include #149578, #136602, #136601, and #83869, but none are directly about rejecting tar streams that contain specific global pax records.
Apologies if I've missed another discussion, however!