PyPi Transparency
- 4 minutes read - 784 wordsI’ve been noodling around with another Trillian personality.
Another in a theme that interests me in providing tamperproof logs for the packages in the popular package management registries.
The Golang team recently announced Go Module Mirror which is built atop Trillian. It seems to me that all the package registries (Go Modules, npm, Maven, NuGet etc.) would benefit from tamperproof logs hosted by a trusted 3rd-party.
As you may have guessed, PyPi Transparency is a log for PyPi packages. PyPi is comprehensive, definitive and trusted but, as with Go Module Mirror, it doesn’t hurt to provide a backup of some of its data. In the case of this solution, Trillian hosts a log of self-calculated SHA-256 hashes for Python packages that are added to it.
The current (!) flow for the tool is that the client takes a variant of requirements.txt
file that includes the package’s filename and – optionally – a path to pip download
‘ed packages. The client calculates SHA-256 hashes of the packages that are its cache and submits the hash with the package’s name, version and filename to the server.
The server uses the package identifiers to obtain its own copy of the package (PyPi, GCS, file system are supported). If configured, the server will cache a copy of the package too (GCS, file system are supported), compute its own SHA-256 hash for the package and corroborate this value with the client.
For the following requirements.txt
:
google-api-core=1.14.2
grpcio==1.23.0
opencensus==0.7.2
prometheus-client=0.7.1
The client configured with a local /pypi-cache
, iterates over the 4 packages, calculating each package’s hash and shipping the combinatio to the server. It first tried to add the package to the server’s repository, then its get the package from the server. Lastly – but not yet implemented – it has the server provide an inclusion proof for the package:
[main:loop] google-api-core [1.14.2]
[File:Open] Package: google-api-core [1.14.2]
[file:Filename] /pypi-cache/google_api_core-1.14.2-py2.py3-none-any.whl
[hash:SHA256] Warning: calculation exhausts reader
[addPackage] Latency: 9.134141
[main] ok:true
[getInclusionProof] Latency: 0.933272
[main]
[getPackage] Latency: 5.468117
[main] ok:true
[main:loop] grpcio [1.23.0]
[File:Open] Package: grpcio [1.23.0]
[file:Filename] /pypi-cache/grpcio-1.23.0-cp37-cp37m-manylinux1_x86_64.whl
[hash:SHA256] Warning: calculation exhausts reader
[addPackage] Latency: 27.740469
[main] ok:true
[getInclusionProof] Latency: 1.076748
[main]
[getPackage] Latency: 3.609743
[main] ok:true
[main:loop] opencensus [0.7.2]
[File:Open] Package: opencensus [0.7.2]
[file:Filename] /pypi-cache/opencensus-0.7.2-py2.py3-none-any.whl
[hash:SHA256] Warning: calculation exhausts reader
[addPackage] Latency: 19.833829
[main] ok:true
[getInclusionProof] Latency: 0.814223
[main]
[getPackage] Latency: 3.593434
[main] ok:true
[main:loop] prometheus_client [0.7.1]
[File:Open] Package: prometheus_client [0.7.1]
[file:Filename] /pypi-cache/prometheus_client-0.7.1.tar.gz
[hash:SHA256] Warning: calculation exhausts reader
[addPackage] Latency: 14.878428
[main] ok:true
[getInclusionProof] Latency: 1.662481
[main]
[getPackage] Latency: 3.828621
[main] ok:true
The server is configured using its own /server-cache
but it will pull un-cached packages directly from PyPi. The server responds to client requests to add packages by verifying the client’s package’s name, version and SHA-256. The server does not receive packages from the client; it pulls packages for itself to verify them. The server responds to client requests to get packages by furnishing the client with the SHA-256 hash of packages requested:
[server:AddPackage] Started
[RepositoryClient:Open] Proxy configured
[File:Open] Package: google-api-core [1.14.2]
[file:Filename] /server-cache/google_api_core-1.14.2-py2.py3-none-any.whl
[RepositoryClient:Open] Proxy contains package
[hash:SHA256] Warning: calculation exhausts reader
[server:add] Started
[server:add] OK
[server:GetInclusionProof] Started
[prove] Not yet implemented.
[server:GetPackage] Started
[server:get] Started
[server:get] Leaf hash: f0ca7dd23aa9d956eba4168ab1b10d45e97f6127115f2e9b673146dd622b9661
[server:get] ok:true
[server:AddPackage] Started
[RepositoryClient:Open] Proxy configured
[File:Open] Package: grpcio [1.23.0]
[file:Filename] /server-cache/grpcio-1.23.0-cp37-cp37m-manylinux1_x86_64.whl
[RepositoryClient:Open] Proxy contains package
[hash:SHA256] Warning: calculation exhausts reader
[server:add] Started
[server:add] OK
[server:GetInclusionProof] Started
[server:GetPackage] Started
[prove] Not yet implemented.
[server:get] Started
[server:get] Leaf hash: 39ed65593730f8cd30420a06b183c0b7096cb03158b389bc2caadd43b9977000
[server:get] ok:true
[server:AddPackage] Started
[RepositoryClient:Open] Proxy configured
[File:Open] Package: opencensus [0.7.2]
[file:Filename] /server-cache/opencensus-0.7.2-py2.py3-none-any.whl
[RepositoryClient:Open] Proxy contains package
[hash:SHA256] Warning: calculation exhausts reader
[server:add] Started
[server:add] OK
[server:GetPackage] Started
[server:get] Started
[server:get] Leaf hash: 582e6a4c2559c965aebc66eb83d567c41a1914898e17b9f2398e4c965924af5c
[server:GetInclusionProof] Started
[prove] Not yet implemented.
[server:get] ok:true
[server:AddPackage] Started
[RepositoryClient:Open] Proxy configured
[File:Open] Package: prometheus_client [0.7.1]
[file:Filename] /server-cache/prometheus_client-0.7.1.tar.gz
[RepositoryClient:Open] Proxy contains package
[hash:SHA256] Warning: calculation exhausts reader
[server:add] Started
[server:add] OK
[server:GetPackage] Started
[server:get] Started
[server:get] Leaf hash: 0ee9c6d94668eaf344010bd74d5c762a30f73e24165433552a278821c22e2918
[server:GetInclusionProof] Started
[prove] Not yet implemented.
[server:get] ok:true
Not that it’s good practice to define APIs on-the-fly, but I realize now that my API is incorrect. I should make the SHA-256 hash an optional property. I think the hash is an intrinsic property of a package; a specific package’s hash will never change. It but this would be more clear if defined in the proto.
More importantly, I think I need only a “check” (currently called “get”) method. The “add” method is redundant since the server will always need to have its own (trusted) replica of the package. When the client gets a package, the server will – if necessary – pull the package from its configured repo (or proxy) and return its computation of the package’s hash. if the client includes its calculation of the hash, the server can verify whether its correct.
Thinking.