Open Source Coalition Announces 'Model-Signing' to Strengthen ML Supply Chain

70 points by m463 6 months ago

This seems like a solution looking for a problem. Can't you just share your model's hash when releasing it? This is exactly what happens when someone like Mistral share a magnet link to their model. It's just a hash.

vlovich123 6 months ago

That’s exactly what this is:
> Finally, the statement itself contains subjects which are a list of (file path, digest) pairs a predicate type set to https://model_signing/signature/v1.0 and a dictionary of predicates. The idea is to use the predicates to store (and therefor sign) model card information in the future. The verification part reads the sigstore bundle file and firstly verifies that the signature is valid and secondly compute the model's file hashes again to compare against the signed ones.
It’s important to remember that these models tend to be released as multiple files so a single hash is insufficient (unless you do a hash of hashes).
- selfhoster11 6 months ago
  
  There is established precedent for open source projects who have a need to authenticate multiple bundled files within a release: a xxxxSUMS file followed by a detached GPG signature file.
  For example, Ubuntu does it like this:
  - The SHA256SUMS file which lists the hashes of each ISO, manifest, netboot, etc. file: https://releases.ubuntu.com/24.04.2/SHA256SUMS. This can be verified on any Linux system with the standard hashing utilities installed, but by itself not sufficient to protect file integrity.
  - The SHA256SUMS.gpg file which contains a detached GPG/PGP signature of the SHA256SUMS file: https://releases.ubuntu.com/24.04.2/SHA256SUMS.gpg. The signature is tied to a particular GPG key ID (in this case, that key ID is 843938DF228D22F7B3742BC0D94AA3F0EFE21092). If the SHA256SUMS file's detached signature is correct and comes from the correct key ID, you've verified that the files weren't modified in transit or by a mirror.
  This scheme only protects a one-level directory (which is enough for many open-source projects). If you have nested directories, it's time to distribute the model as an archive (in which case you just sign the archive).
  - lostmsu 6 months ago
    
    Or just make a torrent file.

sbszllr 6 months ago

Source: I have a relationship with OpenSSF but not directly involved. I'm involved in a "competing" standard.

As other commenters pointed out this is "just" a signature. However, in the absence of standardised checks, this is a useful intermediate way of addressing the integrity issue around ML supply chain; FWIW today.

Eventually, you want to move to more complete solutions that have more elaborate checks, e.g. provenance of data that went into the model, attested training. C2PA is trying to cover it.

Inference time attestation (which some other commenters are pointing out) -- how can I verify that the response Y actually came from model F, on my data X, Y=F(X) -- is a strongly related but orthogonal problem.

jasonmorton 6 months ago

This lets you verify the signature on the model. It won’t help you tell that a decision came from that model. If you want to verify the inference that a model makes, check out https://github.com/zkonduit/ezkl (our project).

lrvick 6 months ago

Signing models is a start, but not enough.

We need remote models hosted in enclaves with remote attestation and end to end cryptography for inference. Then you can prove client-side that an output from a model was private, and direct without tampering by advertizers, censors, or propagandists.

clacker-o-matic 6 months ago

Personally (i know practically nothing about signing lol) i’m wondering how much actual users are going to use this. I kind of wonder if it’s gonna be gonna be kind of like a hash. Or is this going to be integrated into model software?

vishnudeva 6 months ago

This is amazing to see from Sigstore! Looking forward to more ML specific features in the coming months!

Also looking forward to reading through the SLSA for ML PoC and seeing how it evolves. I was planning to use Witness for model training but wasn't sure how it would work for such a long and intensive process.

mountainriver 6 months ago

Is this a problem today?

anshumankmr 6 months ago

Could be.. let's say you deploy a version of a model that was trained by some bad actors to give some wrong output, you won't have a method to verify it without the hashing technique.