I don't find the cons all that compelling to be honest, or at least I think they warrant further discussion to see if there are workarounds (e.g. a choice of compression scheme for a library like typescript, if they would prefer faster publishes).
It would have been interesting to see what eventually played out if the author hadn't closed the RFC themselves. It could have been the sort of thing that eventually happens after 2 years, but then quietly makes everybody's lives better.
"I don't find the cons all that compelling to be honest"
This is a solid example of how things change at scale. Concerns I wouldn't even think about for my personal website become things I need to think about for the download site being hit by 50,000 of my customers become big deals when operating at the scale of npm.
You'll find those arguments the pointless nitpicking of entrenched interests who just don't want to make any changes, until you experience your very own "oh man, I really thought this change was perfectly safe and now my entire customer base is trashed" moment, and then suddenly things like "hey, we need to consider how this affects old signatures and the speed of decompression and just generally whether this is worth the non-zero risks for what are in the end not really that substantial benefits".
I do not say this as the wise Zen guru sitting cross-legged and meditating from a position of being above it all; I say it looking at my own battle scars from the Perfectly Safe things I've pushed out to my customer base, only to discover some tiny little nit caused me trouble. Fortunately I haven't caused any true catastrophes, but that's as much luck as skill.
Attaining the proper balance between moving forward even though it incurs risk and just not changing things that are working is the hardest part of being a software maintainer, because both extremes are definitely bad. Everyone tends to start out in the former situation, but then when they are inevitably bitten it is important not to overcorrect into terrified fear of ever changing anything.
Yes and no. If I'm paying $5 a month for storage, I probably don't care about saving 5% of my storage costs. If I'm paying $50,000/month in storage costs, 5% savings is a lot more worthwhile to pursue
Doesn't npm belong to Microsoft? It must be hosted in Azure which they own so they must be paying a rock bottom rate for storage, bandwidth, everything.
Maybe, maybe not. If you are on a bandwidth limited connection and you have a bunch of NPM packages to install, 5% of an hour is a few minutes saved. It's likely more than that because long-transfers often need to be restarted.
Those lunches could add up to something significant over time. If you're paying $10 per lunch for 10 years, that's $36,500 which is pretty comparable to the cost of a car.
- Doing 1 hour of effort to save 5% on your $20 lunch is foolhardy for most people. $1/hr is well below US minimum wage.
- Doing 1 hour of effort to save 5% on your $50k car is wise. $2500/hr is well above what most people are making at work.
It's not about whether the $2500 affects my ability to buy the car. It's about whether the time it takes me to save that 5% ends up being worthwhile to me given the actual amount saved.
The question is really "given the person-hours it takes to apply the savings, and the real value of the savings, is the savings worth the person-hours spent?"
If you can get the exact same result for less cost (time and money), why not? Things like enjoyment don't factor in since they can't be directly converted into money.
Why do so many people take illustrative examples literally?
I'm sure you can use your imagination to substitute "lunch" and "car" with other examples where the absolute change makes a difference despite the percent change being the same.
Even taking it literally... The 5% might not tip the scale of whether or not I can purchase the car, but I'll spend a few hours of my time comparing prices at different dealers to save $2500. Most people would consider it dumb if you didn't shop around when making a large purchase.
On the other hand, I'm not going to spend a few hours of my time at lunch so that I can save an extra $1 on a meal.
You'd keep 5c. A significant number of people who find sums up around $2500 give it back unconditionally, with no expectation of reward. Whoever lost $2500 is having a really bad day.
In a great example of the Pareto Principle (80/20), or actually even more extreme, let's only apply this Zopfli optimization if the package download total is equal or more than 1GiB (from the Weekly Traffic in GiB column of the Top 5000 Weekly by Traffic tab of the Google Sheets file from the reddit post).
For reference, total bandwidth used by all 5000 packages is 4_752_397 GiB.
Packages >= 1GiB bandwidth/week - That turns out to be 437 packages (there's a header row, so it's rows 2-438) which uses 4_205_510 GiB.
So 88% of the top 5000 bandwidth is consumed by downloading the top 8.7% (437) packages.
5% is about 210 TiB.
Limiting to the top 100 packages by bandwidth results in 3_217_584 GiB, which is 68% of total bandwidth used by 2% of the total packages.
How often are individuals publishing to NPM? Once a day at most, more typically once a week or month? A few dozen seconds of one person's day every month isn't a terrible trade-off.
Even that's addressable though if there's motivation, since something like transcoding server side during publication just for popular packages would probably get 80% of the benefit with no client-side increase in publication time.
In some scenarios the equation flips, and the enterprise is looking for _more_ scale.
The more bandwidth that Cloudflare needs, the more leverage they have at the peering table. As GitHub's largest repo (the @types / DefinitelyTyped repo owned by Microsoft) gets larger, the more experience the owner of GitHub (also Microsoft) gets in hosting the world's largest git repos.
I would say this qualifies as one of those cases, as npmjs is hosted on Azure. The more resources that NPM needs, the more Microsoft can build towards parity with AWS's footprint.
I'm saying you probably don't find them compelling because from your point of view, the problems don't look important to you. They don't from my point of view either. But my point of view is the wrong point of view. From their point of view this would be plenty to make me think twice and several times over past that from changing something so deeply fundamental to the system for what is a benefit that nobody who is actually paying the price for the package size seems to be particularly enthusiastic about. If the people paying the bandwidth bill aren't even that excited about a 5% reduction, then the cost/benefits analysis tips over into essentially "zero benefit, non-zero cost", and that's not very compelling.
Or you're not understanding how he meant it: there are countless ways to roll out such changes, a hard change is likely a very bad idea as you've correctly pointed out.
But it is possible to do it more gradually, I.e. by sneaking it in with a new API that's used by new npm version or similar.
But it was his choice to make, and it's fine that he didn't feel enough value in pursuing such a tiny file size change
I agree, going from 1 second to 2.5 minutes is a huge negative change, in my opinion. I know publishing a package isn't something you do 10x a day but it's probably a big enough change that, were I doing it, I'd think the publish process is hanging and keep retrying it.
Since it's backwards compatible, individual maintainers could enable it in their own pipeline if they don't have issues with the slowdown. It sounds like it could be a single flag in the publish command.
Probably not worth the added complexity, but in theory, the package could be published immediately with the existing compression and then in the background, replaced with the Zopfli-compressed version.
> Probably not worth the added complexity, but in theory, the package could be published immediately with the existing compression and then in the background, replaced with the Zopfli-compressed version.
Checksum matters aside, wouldn't that turn the 5% bandwidth savings into an almost double bandwidth increase though?
IMHO, considering the complexity to even make it a build time option, the author made the right call.
I don't think that's actually a problem, but it would require continuing to host both versions (at distinct URLs) for any users who may have installed the package before the Zopfli-compressed version completed. Although I think you could also get around this by tracking whether the newly-released package was ever served by the API. If not, which is probably the common case, the old gzip-compressed version could be deleted.
The pros aren't all that compelling either. The npm repo is the only group that this would really be remotely significant for, and there seemed to be no interest. So it doesn't take much of a con to nix a solution to a non-problem.
Every single download, until the end of time is affected: It speeds up the servers, speeds up the updates, saves disk space on the update servers, and saves on bandwidth costs and usage.
Everyone benefits, the only cost is a ultra microscopic time on the front end, and a tiny cost on the client end, and for a very significant number of users, time and money saved. The examples of compression here...
Plus a few years of a compression expert writing a JS implementation of what was likely some very cursed C. And someone auditing its security. And someone maintaining it.
I felt the same. The proposal wasn't rejected! Also, performance gains go beyond user stories - e.g. they reduce infra costs and environmental impact - so I think the main concerns of the maintainers could have been addressed.
They soft-rejected by requiring more validation than was reasonable. I see this all the time. "But did you consider <extremely unlikely issue>? Please go and run more tests."
It's pretty clear that the people making the decision didn't actually care about the bandwidth savings, otherwise they would have put the work in themselves to do this, e.g. by requiring Zopfli for popular packages. I doubt Microsoft cares if it takes an extra 2 minutes to publish Typescript.
Kind of a wild decision considering NPM uses 4.5 PB of traffic per week. 5% of that is 225 TB/week, which according to my brief checks costs around $10k/week!
I guess this is a "not my money" problem fundamentally.
This doesn't seem quite correct to me. They weren't asking for "more validation than was reasonable". They were asking for literally any proof that users would benefit from the proposal. That seems like an entirely reasonable thing to ask before changing the way every single NPM package gets published, ever.
I do agree that 10k/week is non-negligible. Perhaps that means the people responsible for the 10k weren't in the room?
massively increase the open source github actions bill for runners running longer (compute is generally more expensive) to publish for a small decrease in network traffic (bandwidth is cheap at scale)?
> I don't find the cons all that compelling to be honest
I found it reasonable.
The 5% improvement was balanced against the cons of increased cli complexity, lack of native JS zopfli implementation, and slower compression .. and 5% just wasn't worth it at the moment - and I agree.
>or at least I think they warrant further discussion
Yes, but there’s a difference between “this warrants further discussion” and “this warrants further discussion and I’m closing the RFC”. The latter all but guarantees that no further discussion will take place.
> I don't find the cons all that compelling to be honest, or at least I think they warrant further discussion
It needs a novel JS port of a C compresison library, which will be wired into a heavily-used and public-facing toolchain, and is something that will ruin a significant number of peoples' days if it breaks.
For me, that kind of ask needs a compelling use case from the start.
We wouldn't have to worry about over-the-wire package size if the modern DevOps approach wasn't "nuke everything, download from the Internet" every build.
Back in my Java days, most even small-time dev shops had a local Maven registry that would pass through and cache the big ones. A CI job, even if the "container" was nuked before each build, would create maybe a few kilobytes of Internet traffic, possibly none at all.
Now your average CI job spins up a fresh VM or container, pulls a Docker base image, apt installs a bunch of system dependencies, pip/npm/... installs a bunch of project dependencies, packages things up and pushes the image to the Docker registry. No Docker layer caching because it's fresh VM, no package manager caching because it's a fresh container, no object caching because...you get the idea....
Even if we accept that the benefits of the "clean slate every time" approach outweigh the gross inefficiency, why aren't we at least doing basic HTTP caching? I guess ingress is cheap and the egress on the other side is "someone else's money".
After reading the article, this comment and the comment thread further down on pnpm[1], it feels to me like the NPM team are doing everyone a disservice by ignoring the inefficiencies in the packaging system. It may not be deliberate or malicious but they could easily have provided better solutions than the one proposed in the article which, in my opinion is a band-aid solution at best. The real fix would be to implement what you mention here: local registry and caching, and/or symlinking a la pnpm.
Lots of places use a cache like Artifactory so they don't get slammed with costs, and are resilient to network outages and dependency builds vanishing.
Last I checked npm packages were full of garbage including non-source code. There's no reason for node_modules to be as big as it usually is, text compresses extremely well. It's just general sloppiness endemic to the JavaScript ecosystem.
I don't know why, but clipboard libraries tend to be really poorly implemented, especially in scripting languages.
I just checked out clipboardy and all they do is dispatch binaries from the path and hope it's the right one (or if it's even there at all). I think I had a similar experience with Python and Lua scripts. There's an unfunny amount of poorly-written one-off clipboard scripts out there just waiting to be exploited.
I'm only glad that the go-to clipboard library in Rust (arboard) seems solid.
That's on the package publishers, not NPM. They give you an `.npmignore` that's trivially filled out to ensure your package isn't full of garbage, so if someone doesn't bother using that: that's on them, not NPM.
(And it's also a little on the folks who install dependencies: if the cruft in a specific library bothers you, hit up the repo and file an issue (or even MR/PR) to get that .npmignore file filled out. I've helped folks reduce their packages by 50+MB in some cases, it's worth your own time as much as it is theirs)
It's much better to allowlist the files meant to be published using `files` in package.json because you never know what garbage the user has in their folder at the time of publish.
On a typical project with a build step, only a `dist` folder would published.
Not a fan of that one myself (it's far easier to tell what doesn't belong in a package vs. what does belong in a package) but that option does exist, so as a maintainer you really have no excuse, and as a user you have multiple MR/PRs that you can file to help them fix their cruft.
> On a typical project with a build step, only a `dist` folder would published.
Sort of, but always include your docs (readme, changelog, license, and whatever true docs dir you have, if you have one). No one should need a connection for those.
Yep, I wrote a script that starts at a root `node_modules` folder and iterates through to remove anything not required (dotfiles, Dockerfile, .md files, etc) - in one of our smaller apps this removes about 25Mb of fluff, some packages are up to 60-70mb of crap removed.
One of the things I like about node_modules is that it's not purely source code and it's not purely build artifacts.
You can read the code and you can usually read the actual README/docs/tests of the package instead of having to find it online. And you can usually edit library code for debugging purposes.
If node_modules is taking up a lot of space across a bunch of old projects, just write the `find` script that recursively deletes them all; You can always run `npm install` in the future when you need to work on that project again.
As someone who mostly works in Java it continues to floor me that this isn’t the default. Why does every project I work on need an identical copy of possibly hundreds of packages if they’re the same version?
I also like Yarn pnp’s model of leaving node_modules as zip files. CPUs are way faster than storage, they can decompress on the fly. Less disk space at rest, less disk slack, less filesystem bookkeeping.
Every single filesystem is way faster at dealing with one file than dozens/hundreds. Now multiply that by the the hundreds if does, it add up.
You'll also see the benefit when `rm -rf`ing a `node_modules` and re-installing, as pnpm still has a local copy that it can re-link after validating its integrity.
Props to anyone who tries to make the world a better place.
Its not always obvious who has the most important use cases. In the case of NPM they are prioritizing the user experience of module authors. I totally see how this change would be great for module consumers, yet create potentially massive inconvenience for module authors.
I think "massive" is overstating it. I don't think deploying a new version of a package is something that happens many times a day, so it wouldn't be a constant pain point.
Also, since this is a case of having something compressed once and decompressed potentially thousands of times, it seems like the perfect tool for the job.
Module authors generally have fairly large test suites which are run often- sometimes on each file save. If you have a 1 or 2 second build script its not a huge deal. If that script starts taking 30-60 seconds- you have just hosed productivity. Also you have massively increased the load on your CI server- possibly bumping you out of a free tier.
The fix would then have to be some variation of:
a) Stop testing (so often)
b) Stop bundling before testing
c) Publish to a different package manager
- all of which would affect the overall quality and quantity npm modules.
You have to test that your bundle _actually works_, especially if you are using non-standard compression.
But yes- you could bundle less, but that would be a disadvantage, particularly if a bundle suddenly fails and you don’t know which change caused it. But maybe that’s not a big deal for your use case.
A few people have mentioned the environmental angle, but I'd care more about if/how much this slows down decompression on the client. Compressing React 20x slower once is one thing, but 50 million decompressions being even 1% slower is likely net more energy intensive, even accounting for the saved energy transmitting 4-5% fewer bits on the wire.
It's very likely zero or positive impact on the decompression side of things.
Starting with smaller data means everything ends up smaller. It's the same decompression algorithm in all cases, so it's not some special / unoptimized branch of code. It's yielding the same data in the end, so writes equal out plus or minus disk queue fullness and power cycles. It's _maybe_ better for RAM and CPU because more data fits in cache, so less memory is used and the compute is idle less often.
It's relatively easy to test decompression efficiency if you think CPU time is a good proxy for energy usage: go find something like React and test the decomp time of gzip -9 vs zopfli. Or even better, find something similar but much bigger so you can see the delta and it's not lost in rounding errors.
I can speak to this - there is no meaningful decompression effect across an insane set of tested data at Google and elsewhere. Zopfli was invented prior to brotli
Zopfli is easiest to think of as something that just tries harder than gzip to find matches and better encodings.
Much harder.
decompression speed is linear either way.
It's easiest to think of decompression as a linear time vm executor[1], where the bytecoded instructions are basically
go back <distance> bytes, output the next <length> bytes you see, then output character <c>
(outputting literal data is the instruction <0,0,{character to output}>)
Assuming you did not output a file larger than the original uncompressed file (why would you bother?), you will, worst case, process N bytes during decompression, where N is the size of the original input file.
The practical decompression speed is driven by cache behavior, but it thrashes the cache no matter what.
In practice, reduction of size vs gzip occurs by either finding larger runs, or encodings that are smaller than the existing ones.
After all, if you want the compressed file to shrink, you need output less instructions somehow, or make more of the instructions identical (so they can be represented in less bits by later huffman coding).
In practice, this has almost exclusively positive effects on decompression speed - either the vm has less things to process (which is faster), or more of the things it does look the same (which has better cache behavior).
[1] this is one way archive formats will sometimes choose to deal with multiple compression method support - encode them all to the same kind of bytecode (usually some form of copy + literal instruction set), and then decoding is the same for all of them. ~all compression algorithms output some bytecode like the above on their own already, so it's not a lot of work.
This doesn't help you support other archive formats, but if you want to have a bunch of per-file compression options that you pick from based on what works best, this enables you to still only have to have one decoder.
For formats like deflate, decompression time doesn't generally depend on compressed size. (zstd is similar, though memory use can depend on the compression level used).
This means an optimization like this is virtually guaranteed to be a net positive on the receiving end, since you always save a bit of time/energy when downloading a smaller compressed file.
This seems like a place where the more ambitious version that switches to ZSTD might have better tradeoffs. You would get similar or better compression, with faster decompression and recompression than zstd.It would lose backward compatibility though...
Not necessarily - could retain backward compat by publishing both gzip and zstd variants and having downloaders with newer npm’s prefer to download zstd. Over time, you could require packages only upload zstd going forward and either generate zstd versions of the backlog of unmaintained packages or at least those that see some amount of traffic over some time period if you’re willing to drop very old packages. The ability to install arbitrary versions of packages probably means you’re probably better off reprocessing the backlog although that may cost more than doing nothing.
The package lock checksum is probably a more solvable issue with some coordination.
The benefit of doing this though is less immediate - it will take a few years to show payoff and these kinds of payoffs are not typically made by the kind of committee decisions process described (for better or worse).
Brotli and lzo1b have good compression ratios and pretty fast decompression speeds. Compression speed should not matter that much, since you only do it once.
Thats a much higher hurdle to jump. I don’t blame the author for trying this first.
If accepted, it might have been a good stepping stone too. A chance to get to know everyone and their concerns and how they think.
So if you wanted to see how this works (proposal + in prod) and then come back later proposing something bigger by switching off zip that would make sense to me as a possible follow up.
Years back I came to the conclusion that conda using bzip2 for compression was a big mistake.
Back then if you wanted to use a particular neural network it was meant for a certain version of Tensorflow which expected you to have a certain version of the CUDA libs.
If you had to work with multiple models the "normal" way to do things was use the developer unfriendly [1][2] installers from NVIDIA to install a single version of the libs at a time.
Turned out you could have many versions of CUDA installed as long as you kept them in different directories and set the library path accordingly, it made sense to pack them up for conda and install them together with everything else.
But oh boy was it slow to unpack those bzip2 packages! Since conda had good caching, if you build environments often at all you could be paying more in decompress time than you pay in compression time.
If you were building a new system today you'd probably use zstd since it beats gzip on both speed and compression.
[1] click... click... click...
[2] like they're really going to do something useful with my email address
>But oh boy was it slow to unpack those bzip2 packages! Since conda had good caching, if you build environments often at all you could be paying more in decompress time than you pay in compression time.
For Paper, I'm planning to cache both the wheel archives (so that they're available without recompressing on demand) and unpacked versions (installing into new environments will generally use hard links to the unpacked cache, where possible).
> If you were building a new system today you'd probably use zstd since it beats gzip on both speed and compression.
FWIW, in my testing LZMA is a big win (and I'm sure zstd would be as well, but LZMA has standard library support already). But there are serious roadblocks to adopting a change like that in the Python ecosystem. This sort of idea puts them several layers deep in meta-discussion - see for example https://discuss.python.org/t/pep-777-how-to-re-invent-the-wh... . In general, progress on Python packaging gets stuck in a double-bind: try to change too little and you won't get any buy-in that it's worthwhile, but try to change too much and everyone will freak out about backwards compatibility.
I designed a system which was a lot like uv but written in Python and when I looked at the politics I decided not to go forward with it. (My system also had the problem that it had to be isolated from other Pythons so it would not get its environment trashed, with the ability for software developers to trash their environment I wasn't sure it was a problem that could be 100% solved. uv solved it by not being written in Python. Genius!)
Yes, well - if I still had reason to care about the politics I'd be in much the same position, I'm sure. As is, I'm going to just make the thing, write about it, and see who likes it.
Comparing to gzip isn't really worth it. Combine pigz (threaded) with zlib-ng (simd) and you get decent performance. pigz is used in `docker push`.
For example, gzipping llvm.tar (624MB) takes less than a second for me:
$ time /home/harmen/spack/opt/spack/linux-ubuntu24.04-zen2/gcc-13.2.0/pigz-2.8-5ptdjrmudifhjvhb757ym2bzvgtcsoqc/bin/pigz -k hello.tar
real 0m0.779s
user 0m11.126s
sys 0m0.460s
At the same time, zopfli compiled with -O3 -march=native takes 35 minutes. No wonder it's not popular.
It is almost 2700x slower than the state of the art for just 6.8% bytes saved.
In my opinion even the 28x decrease in performance mentioned would be a no-go. Sure the package saves a few bytes but I don't need my entire pc to grind to a halt every time I publish a package.
Besides, storage is cheap but CPU power draw is not. Imagine the additional CO2 that would have to be produced if this RFC was merged.
> 2 gigabytes of bandwidth per year across all installations
This must be a really rough estimate and I am curious how it was calculated. In any case 2 gigabytes over a year is absolutely nothing. Just my home network can produce a terabyte a day.
Because the authors mentioned package, Helmet[1], is 103KB uncompressed and has had 132 versions in 13 years. Meaning downloading every Helmet version uncompressed would result in 132*103KB = 13.7MB.
I feel like I must be missing something really obvious.
Congrats on a great write-up. Sometimes trying to ship something at that sorta scale turns out to just not really make sense in a way that is hard to see at the beginning.
Another personal win is that you got a very thorough understanding of the people involved and how the outreach parts of the RFC process works. I've also had a few fail, but I've also had a few pass! Always easier to do the next time
Pulling on this thread, there are a few people who have looked at the ways zopfli is inefficient. Including this guy who forked it, and tried to contribute a couple improvements back to master:
These days if you’re going to iterate on a solution you’d better make it multithreaded. We have laptops where sequential code uses 8% of the available cpu.
> These days if you’re going to iterate on a solution you’d better make it multithreaded.
Repetition eliminating compression tends to be inherently sequential. You'd probably need to change the file format to support chunks (or multiple streams) to do so.
Because of LZ back references, you can't LZ compress different chunks separately on different cores and have only one compression stream.
Statistics acquisition (histograms) and entropy coding could be parallel I guess.
(Not a compression guru, so take above with a pinch of salt.)
There are gzip variants that break the file into blocks and run in parallel. They lose a couple of % by truncating the available history.
But zopfli appears to do a lot of backtracking to find the best permutations for matching runs that have several different solutions. There’s a couple of ways you could run those in parallel. Some with a lot of coordination overhead, others with a lot
of redundant calculation.
I wonder if it would make more sense to pursue Brotli at this point, Node has had it built-in since 10.x so it should be pretty ubiquitous by now. It would require an update to NPM itself though.
This reminds me of a time I lost an argument with John-David Dalton about cleaning up/minifying lodash as an npm dependency, because when including the readme and license for every sub-library, a lodash import came to ~2.5MB at the time. This also took a lot of seeking time for disks because there were so many individual files.
The conversation started and ended at the word cache.
> This also took a lot of seeking time for disks because there were so many individual files.
The fact NPM keeps things in node_modules unzipped seems wild to me. Filesystems are not great at hundreds of thousands of little files. Some are bad, others are terrible.
Zip files are easier to store, take up less space, and CPUs are faster than disks so the decompression in memory is probably faster reading the unzipped files.
That was one of my favorite features of Yarn when I tried it - pnp mode. But since it’s not what NPM does it requires a shim that doesn’t work with all ps mage’s. Or at least didn’t a few years ago.
I'd love to see an effort like like this succeed in the Python ecosystem. Right now, PyPI is dependent upon Fastly to serve files, on the order of >1 petabyte per day. That's a truly massive in-kind donation, compared to the PSF's operating budget (only a few million dollars per year - far smaller than Linux or Mozilla).
I don't see why it wouldn't be possible to hide behind a flag once Node.js supports zopfli natively.
In case of CI/CD, it's totally feasible to just add a --strong-compression flag. In that case, the user expects it to take its time.
TS releases a non-preview version every few months, so using 2.5 minutes for compression would work.
Think of the complexity involved, think of having to fix bugs because something went wrong.
Such effort would be better spent preparing npm/node for a future where packages with a lower-bound npm version constraint can be compressed with zstd or similar.
Every feature you add to a program is complexity, you need to really decide if you want it.
What about a different approach - an optional npm proxy that recompresses popular packages with 7z/etc in the background?
Could verify package integrity by hashing contents rather than archives, plus digital signatures for recompressed versions. Only kicks in for frequently downloaded packages once compression is ready.
Benefits: No npm changes needed, opt-in only, potential for big bandwidth savings on popular packages. Main tradeoff is additional verification steps, but they could be optional given a digital signature approach.
Curious if others see major security holes in this approach?
This felt like the obvious way to do things to me: hash a .tar file, not a .tar.gz file. Use Accept-Encoding to negotiate the compression scheme for transfers. CDN can compress on the fly or optionally cache precompressed files. i.e. just use standard off-the-shelf HTTP features. These days I prefer uncompressed .tar files anyway because ZFS has transparent zstd, so decompressed archive files are generally smaller than a .gz.
For security reasons, it's usually better to hash the compressed file, since it reduces the attack surface: the decompressor is not exposed to unverified data. There have already been vulnerabilities in decompressor implementations which can be exploited through malformed compressed data (and this includes IIRC at least one vulnerability in zlib, which is the standard decompressor for .gz).
This suggests one should just upload a tar rather than a compressed file. Makes sense because one can scan the contents for malicious files without risking a decompressor bug.
BTW npm decompressed all packages anyhow because it lets you view the contents these days on its website.
You are correct. They should be uploading and downloading dumb tar files and let the HTTP connection negotiate the compression method. All hashes should be based on the uncompressed raw tar dump. This would be proper separation of concerns.
But npm already decompresses every package because it shows the contents on its website. So yeah it can be malicious but it already has dealt with that risk.
I created once a maven plugin to recompress Java artefacts with zopfli. I rewrote it in Java and runs entirely in the JVM. This means, the speed is worse and may contain bugs:
I've not done it, but have you considered using `pnpm` and volume-mounting a shared persistent `pnpm-store` into the containers? It seems like you'd get near-instant npm installs that way.
The only time npm install was on the critical path was hotfixes. It’s definitely worth considering. But I was already deep into doing people giant favors that they didn’t even notice, so I was juggling many other goals. I think the only thank you I got was from the UI lead, who had some soda straw internet connection and this and another thing I did saved him a bunch of hard to recover timeouts.
If we consider overall Internet traffic, which is dominated by video and images, the size of javascript transmitted is negligible.
Savings of memory and cpu for end user's browser? Maybe, but our software at all layers is so bloated, without need, just carelessly, that I'm not sure working of such javascript savings is useful - any resources saved will be eaten by something else immediately.
For an application developer or operator, the 50% savings of javascript size are probably not worth dealing with some tool that dynamically prunes the app code, raising questions about privacy and security. I though through the security and privacy questions, but who would want to even spend attention on these considerations?
As I mention in the README at github, a Microsoft researcher investigated the same approach earlier, but Microsoft haven't taken it anywhere.
There was a commercial company offering this approach of javascript minification as a product, complete product.
As well as other optimizations, like image size, etc. Their proxy embedded special "agent" code into the app which inspected the device and reported to server what image sizes are optimal for user, what js functions are invoked. And the server prepared optimized versions of app for various devices. Javascript was "streamed" in batches of only functions needed by the app.
Now the company is dissolved - that's why the website is unavailable. Wikipedia says they were bought out by Akamai - https://en.wikipedia.org/wiki/Instart. But I don't see any traces of this approach in the today's Akamai offerings.
I contacted several companies, in CDN business and others, trying to interest them in the idea and get very modest funding for a couple more months of my time to work on this (I was working on this in the end of a long break from payed work and was running out of savings). Didn't find anyone ready to take part.
This all may be signs that possibility of such an optimization is not valuable enough for users.
50% size savings isn't important to the people who pay for it. They pay at most pennies for 100% savings (that is somehow all the functionality in zero bytes - not worth anything to those paying the bills)
Size savings translates to latency improvements which directly affects conversion rates. Smaller size isn’t about reducing costs but increased revenue. People care.
Note that this proof-of-concept implementation saves latency on first load, but may add latency at surprising points while using the website. Any user invoking a rarely-used function would see a delay before the javascript executes, without the traditional UI affordances (spinners etc) to indicate that the application was waiting on the network. Further, these secretly-slow paths may change from visit to visit. Many users know how to "wait for the app to be ready," but the traditional expectation is that once it's loaded, the page/app will work, and any further delays will be signposted.
I'm sure it works great when you've got high-speed internet, but might break things unacceptably for users on mobile or satellite connections.
Only the first user who hits a rarely used execution point may experience the noticeable latency, if he also has slow internet, etc.
As soon as the user executes a rarely used function, the information is about this fact is sent to the server and it includes this function into the active set to be send to future users.
In the video I manually initiate re-generation of the "active" and the "rest" scripts, but the most primitive MVP was supposed to schedule re-generation of the scripts when receiving info that some previously unseen functions are executed in browser.
Obviously, if the idea is developed further, the first user's experience may also be improved - a spinner shown. Pre-loading the inactive set of functions in background may also be considered (pros: it allows to avoid latency for users ho invoke rare functionality, cons: we lose the savings of the traffic and browser memory and cpu for compiling the likely unneded code).
(BTW, further development of the idea includes to splitting the code more granularity than just "active" / "inactive". E.g. active in the first 5 seconds after opening the page loaded immediately, likely to be active soon, active but rarely called - the later two these parts definitely need to be loaded in background)
> without the traditional UI affordances (spinners etc) to indicate that the application was waiting on the network.
This part is obviously trivially solvable. I think the same basic idea is going to at some point make it but it’ll have to be through explicit annotations first and then there will be tooling to automatically do this for your code based upon historical visits where you get to tune the % of visitors that get additional fetches. Also, you could probably fetch the split off script in the background anyway as a prefetch + download everything rather than just 1 function at a time (or even downloading related groups of functions together)
The idea has lots of merit and you just have to execute it right.
This would cause the bundler to inject a split point & know how to hide that + know what needs bundling and what doesn’t. GWT pioneered almost 20 years ago although not a fan of the syntax they invented to keep everything running within stock Java syntax: https://www.gwtproject.org/doc/latest/DevGuideCodeSplitting....
The tooling producing such annotaion based on historical visits whould constantly change the annotated set of functions, I suspect, whith app versions evolving.
Note, the inactive code parts are very often in 3rd party libraries.
Some are also in your own libraries shared between your different application. So the same shared library whould need one set of split-annotated functions for one app, and another set for another app. So these conflicting sets of annotations can not live in the library source code simultaneously.
For example the automated system could take your non-split codebase & inject splits at any async function & self-optimize as needed. The manual annotation support would be to get the ecosystem going as an initial step because I suspect targeting auto-optimizing right from the get go may be too big a pill to swallow.
Who? anyone who is on a slow internet connection or who has a slow device can tell you they don't care. Or maybe they do but features are far more important. I guess if things are slow on a top of the line device withia fast connection they would care.
A pro could have been an extra narrative about carbon footprint savings.
I'm surprised it hasn't been raised when talking about saving 2Tb/year only for React. It represents costs, which doesn't seem to be an issue, but also computing power and storage. (Event with a higher/longer computing power due to slower compression, it's done once per version, which isn't really comprable to the amount of downloads anyway)
Hard to calculate the exact saving, but it would represent a smaller CO2 footprint..
It only doesn't apply to existing versions of existing packages. Newer releases would apply Zopfli, so over time likely the majority of actively used/maintained packages would be recompressed.
These days technology moves so fast it's hard to keep up. The slowest link in the system is the human being.
That's a strong argument that 'if it isn't broke, don't fix it."
LOts of numbers being thrown around, you add up tiny things enough times you can get a big number. But is npm package download the thing that's tanking the internet? No? Then this is a second- or thirt-order optimization.
> Integrating Zopfli into the npm CLI would be difficult.
Is it possible to modify "gzip -9" or zlib to invoke zopfli? This way everyone who wants to compress better will get the extra compression automatically, in addition to npm.
There will be an increase in compression time, but since "gzip -9" is not the default, people preferring compression speed might not be affected.
I used zopflipng in the past to optimize PNG images. It made sense since there was no better alternative to store lossless image data than the PNG format at the given time in the given environment. Zopfli is awesome when you are locked in on deflate compression. I feel like if the npm folks would want to optimize for smaller package size a better strategy would be switching to some more effective text compression (e.g. bzip2, xz). That would result into a larger file size reduction than 5% for a smaller CPU time increase compared to Zopfli. You would need to come up with some migration strategy though as this change isn't per-se backwards compatible, but that seems manageable if you are in control of the tooling and infrastructure.
I ll give you an even better idea but it ll need atleast a 100 volunteers, maybe a 1000. Take each package and rewrite it without external dependencies. That will cut tech debt for that package significantly. Just like how we have a @types/xyz where some dude named DefinitelyTyped is busy making typescript packages for everything, lets make a namespace like @efficient/cors @efficient/jsdom @efficient/jest etc and eliminate all external dependencies completely for every library on npm
If OP wanted to shrink nom packages then npm could introduce two types of npm package - the build package and source one. This way a lot of packages would be smaller, because package code could be safely distributed through a separate package and not kept in the build package. There are a lot npm packages that explicitly include whole git repo in the package and they do so because there's only one type of package they can use
> Zopfli is written in C, which presents challenges. Unless it was added to Node core, the CLI would need to (1) rewrite Zopfli in JS, possibly impacting performance (2) rely on a native module, impacting reliability (3) rely on a WebAssembly module. All of these options add complexity.
Wow! Who's going to tell them that V8 is written in C++? :)
It's not about C per-se, as much as each native compiled dependency creates additional maintenance concerns. Changes to hardware/OS can require a recompile or even fixes. NPM build system already requires a JavaScript runtime, so is already handled as part of existing maintenance. The point is that Zopfli either needs to be rewritten for a platform-agnostic abstraction they already support, or else Zopfli will be added to a list of native modules to maintain.
> It's not about C per-se, as much as each native compiled dependency creates additional maintenance concerns. Changes to hardware/OS can require a recompile or even fixes.
This is a canard. zopfli is written in portable C and is far more portable than the nodejs runtime. On any hardware/OS combo that one can build the nodejs runtime, they certainly can also build and run zopfli.
Yes, but it was expected. It's like prioritising code readability over performance everywhere but the hot path.
Earlier in my career, I managed to use Zopfli once to compress gigabytes of PNG assets into a fast in-memory database supporting a 50K+ RPS web page. We wanted to keep it simple and avoid the complexity of horizontal scaling, and it was OK to drop some rarely used images. So the more images we could pack into a single server, the more coverage we had. In that sense Zopfli was beneficial.
I wonder what is the tarball size difference on average if you'd for example download everything in one tarball (full package list) instead of 1-by-1 as the gzip compression would work way better in that case.
Also for bigger companies this is not really a "big" problem as they usually have in-house proxies (as you cannot rely on a 3rd party repository in CI/CD for multiple reasons (security, audit, speed, etc)).
You might save a little bit by putting similar very small files next to each other in the same tarball, but in general I would not expect significant improvements. Gzip can only compress repetitions that are within 32KB of each other.
The os should do file deduplication and compression and decompression faster than npm. But I guess the issue is npm cannot give the os hints to compress a folder or not?
Even if the file system does it that’s still per-file. You still need all the bookkeeping. And you’re not gonna get great compression if a lot of the files are tiny.
The tar contains a whole bunch of related files that probably compress a lot better together, and it’s only one file on disk for the file system to keep track of. It’s gonna be a lot more efficient.
When compiling you’re probably gonna touch all the source files anyway right? So if you load the zip into memory once and decompress it there it’s probably faster than loading each individual file and decompressing each of them, or just loading each individual file raw.
> And you’re not gonna get great compression if a lot of the files are tiny.
The future of compression is likely shared dictionaries, basically recognize the file type and then use a dictionary that is optimized for that file type. E.g. JavaScript/TypeScript or HTML or Rust or C++, etc. That can offset the problems with small files to a large degree.
But again, this is a complex problem and it should be dealt with orthogonally to npm in my opinion.
I also checked and it should be possible for npm to enable file system compression at the OS level on both Windows and MacOS.
Also if it was dealt with orthogonally to npm, then it could be used by pip, and other package systems.
Thank you so much for posting this. The original logic was clear and it had me excited! I believe this is useful because compression is very common and although it might not fit perfectly in this scenario, it could very well be a breakthrough in another. If I come across a framework that could also benefit from this compression algorithm, I'll be sure to give you credit.
5% improvement is basically the minimum I usually consider worthwhile to pursue, but it's still small. Once you get to 10% or 20%, things become much more attractive. I can see how people can go either way on a 5% increase if there are any negative consequences (such as increased build time).
I was under the impression that bzip compresses more than gzip, but gzip is much faster, so gzip is better for things that need to be compressed on the fly, and bzip is better fro things that get archived. Is this not true? Wouldn't it have been better to use bzip all along for this purpose?
I wonder if you could get better results if you built a dictionary over entire npm. I suspect most common words could easily be reduced to 16k word index. Would be much faster, dictionary would probably fit in cache, can even optimize it in memory for cache prefetch.
This seems like a non-starter to me - new packages are added to npm all the time, and will alter the word frequency distribution. If you aren't prepared to re-build constantly and accept that the dictionary isn't optimal, then it's hard to imagine it being significantly better than what you build with a more naive approach. Basically - why try to fine-tune to a moving target?
But is it really moving that fast ? I suspect most fundamental terms in programming and the variations do not change often. You will always have keywords, built ins and the most popular concepts from libs/frameworks.
So it is basically downloading a few hundred kb dictionary every year ?
It reminds me of an effort to improve docker image format and make it move away from being just a tar file. I can't find links anymore, but it was a pretty clever design, which still couldn't beat dumb tar in efficiency.
Transferring around dumb tar is actually smart because the HTTPS connection can negotiate a compressed version of it to transfer - e.g. gzip, brotli, etc. No need to bake in an unchangable compression format into the standard.
> For example, I tried recompressing the latest version of the typescript package. GNU tar was able to completely compress the archive in about 1.2 seconds on my machine. Zopfli, with just 1 iteration, took 2.5 minutes.
My question of course would be, what about LZ4, or Zstd or Brotli? Or is backwards compatibility strictly necessary? I understand that GZIP is still a good compressor, so those others may not produce meaningful gains. But, as the author suggests, even small gains can produce huge results in bandwidth reduction.
Hashes of the tarballs are recorded in the package-lock.json of downstream dependants, so recompressing the files in place will cause the hashes to change and break everyone. It has to be done at upload time.
The hashes of the uncompressed tarballs would be great. Then the HTTP connection can negotiate a compression format for transfer (which can change over time at HTTP itself changes) rather than baking it into the NPM package standard (which is incredibly inflexible.)
My reading of OP is that it’s less about whether zopfli is technically the best way to achieve a 5% reduction in package size, and more about how that relatively simple proposal interacted with the NPM committee. Do you think something like this would fare better or differently for some reason?
- I don't see the problem with adding a C dependency to a node project, native modules are one of the nicest things about node.js
- Longer publishing time for huge packages (eg. typescript) is a bigger problem but I don't think it should impact a default. Just give the option to use gzip for the few outliers.
> But the cons were substantial:
...
> This wouldn’t retroactively apply to existing packages.
Why is this substantial? My understanding is that packages shouldn't be touched once published. It seems likely for any change to not apply retroactively.
That's not actually so straightforward. You pay the 10-100x slowdown once on the compressing side, to save 4-5% on every download - which for a popular package one would expect downloads to be in the millions.
Most packages don‘t publish every CI build. Some packages, such as Typescript, publish a nightly build once a day. Even then a longer compression time dosen‘t seem too bad.
Caching downloads on a CDN helps offload work from the main server, it doesn't meaningfully change the bandwidth picture from the client's perspective.
Assuming download and decompression cost to be proportional to the size of the incoming compressed stream, it would break even at 2000 downloads. A big assumption I know, but 2000 is a really small number.
As the author himself said, just React was downloaded half a billion times; that is a lot of saved bandwidth on both sides, but especially so for the server.
Maybe it would make sense to only apply this improvement in images that are a) either very big or b) get downloaded at least million times each year or so. That would cover most of the savings while leaving most packages and developers out of it.
If you want an example of where these things go badly:
the standard compression level for rpms on redhat distros is zstd level 19.
This has all the downsides of other algorithms - it's super slow - often 75-100x slower than the default level of zstd[1]. It achieves a few percent more compression for that speed. compared to even level 10, it's like 0-1% higher compression, but 20x slower.
This is bad enough - the kicker is that at this level, it's slower than xz for ~all cases, and xz is 10% smaller.
The only reason to use zstd is because you want fairly good compression, but fast.
So here, they've chosen to use it in a way that compresses really slowly, but gives you none of the benefit of compressing really slowly.
Now, unlike the npm case, there was no good reason to choose level 19 - there were no backwards compatibility constraints driving it, etc. I went through the PR history on this change, it was not particularly illuminating (it does not seem lots of thought was given to the level choice).
I mention all this because it has a real effect on the experience of building rpms - this is why it takes eons to make kernel debuginfo rpms on fedora. Or any large RPM. Almost all time is spent compressing it with zstd at level 19. On my computer this takes many minutes. If you switch it to use even xz, it will do it about 15-20x faster (single threaded. if you thread both of them, xz will win by even more, because of how slow the setting is for zstd. If you use reasonable settings for zstd, obviously, it achieves gigabytes/second in parallel mode)
Using zopfli would be like choosing level 19 zstd for npm.
While backwards compatibility is certainly painful to deal with here, zopfli is not likely better than doing nothing.
You will make certain cases just insanely slow.
You will save someone's bandwidth, but in exchange you will burn insane amounts of developer CPU.
zopfli is worse than level 19, it can often be 100x-200x slower than gzip -9.
Doesn't npm support insanely relaxed/etc scripting hooks anyway?
If so, if backwards compatibility is your main constraint, you would be "better off" double compressing (IE embed xz or whatever + a bootstrap decompressor and using the hooks to decompress it on old versions of npm). Or just shipping .tar.gz's than, when run, fetch the .xz and decompress it on older npm.
Or you know, fish in the right pond - you would almost certainly achieve much higher reductions by enforcing cleaner shipping packages (IE not including random garbage, etc) than by compressing the garbage more.
[1] on my computer, single threaded level 19 does 7meg/second, the default does 500meg/second. Level 10 does about 130meg/second.
> the standard compression level for rpms on redhat distros is zstd level 19
> The only reason to use zstd is because you want fairly good compression, but fast
I would think having fast decompression is desirable, too, especially for rpms on redhat distros, which get decompressed a lot more often than they get compressed, and where the CPUs doing decompression may be a lot slower than the CPUs doing the compression.
Just two tables for comparison, first one shows only the decompression for Firefox RPM, second shows compression time, compressed size, and decompression for a large RPM.
Let me start by reiterating - zstd is a great option. I think zstd level 5-10 would have been an awesome choice. I love zstd - it is a great algorithm that really hits the sweet spot for most users between really fast and good compression, and very very fast decompression. I use it all the time.
In this case, yes, zstd has faster decompression , but xz decompression speed is quite fast, even before you start using threads. I have no idea why their data found it so slow.
Even on large compressed archives, xz decompression times does not go into minutes - even on a mid range 16 year old intel CPU like being used in this test. Assuming the redhat data on this single rpm is correct, i would bet it's more related to some weird buffering or chunking issue in the rpm's compressor library usage than actual xz decompression times. But nobody seems to have bothered look at why the times seemed ridiculous, at all, they just sort of accepted them as is.
They also based this particular change on the idea that they would get similar compression ratio to xz - they don't, as i showed.
Anyway, my point really wasn't "use xz", but that choosing zstd level 19 is probably the wrong choice no matter what.
There own table, which gives you data on 1 whole rpm, shows that zstd level 15 gave them compression comparable to xz (on that RPM, it's wrong in general), at compression speed similar to xz (also wrong in general, it's much slower than that).
It also showed that level 19 was 3x slower than that for no benefit.
Result: Let's use level 19.
Further the claim "Users that build their packages will experience slightly longer build times." is total nonsense - their own table shows this. If you had an RPM that was 1.6gb, but took 5 minutes to build (not uncommon, even for that size, since it's usually assets of some sort), you are now taking 30 minutes, and spending 24 of it on compression.
Before it took ... 3 minutes to do compression.
Calling this "slightly longer build times" is hilarious at best.
I'll make it concrete: their claim is based on building Firefox and compressing the result, and amusingly, even there it's still wrong.
Firefox RPM build times on my machine are about 10-15 minutes. Before it took 3 minutes to compress the RPM. Now it takes 24.
This is not "slightly longer build times". Before it took 30% of the build time to compress the RPM.
Now it takes 24 minutes, or 2.5x the entire build time.
That is many things, but it is not a "slightly longer build time".
I'll just twist the knife a little more:
RPM supports using threading for the compressors, which is quite nice. It even supports basing it on the number of cpus you have set to use for builds. They give examples of how to do it, including for level 19:
/usr/lib/rpm/macros:
# "w19T8.zstdio" zstd level 19 using 8 threads
# "w7T0.zstdio" zstd level 7 using %{getncpus} threads
The table with this single rpm even tested it with threads!
Despite this - they did not turn on threads in the result...
First, here is no data or evidence to suggest this is the case, so not sure why you are trying to make up excuses for them?
Second, zstd is fully deterministic in multithreaded cases.
It does not matter what threading you select, it will output byte for byte identical results.
I believe all of their compressors are similarly deterministic regardless of number of threads, but i admit i have not checked every one of them under all conditions.
If they had questions, they could have, you know, asked, and would have gotten the same answer.
But that just goes back to what i said - it does not appear this change was particularly well thought out.
But this whole thing sounds too much like work. Finding opportunities, convincing entrenched stakeholders, accommodating irrelevant feedback, pitching in meetings — this is the kind of thing that top engineers get paid a lot of money to do.
For me personally open source is the time to be creative and free. So my tolerance for anything more than review is very low. And I would have quit at the first roadblock.
What’s a little sad, is NPM should not be operating like a company with 1000+ employees. The “persuade us users want this” approach is only going to stop volunteers. They should be proactively identifying efforts like this and helping you bring it across the finish line.
I think that the reason NPM responded this way is because it was a premature optimization.
If/when NPM has a problem - storage costs are too high, or transfer costs are too high, or user feedback indicates that users are unhappy with transfer sizes - then they will be ready to listen to this kind of proposal.
I think their response was completely rational, especially given a potentially huge impact on compute costs and/or publication latency.
I disagree with it being a premature optimisation. Treating everything that you haven’t already identified personally as a problem as a premature optimisation is cargo cutting in its own way. The attitude of not caring is why npm and various tools are so slow.
That said, I think NPM’s response was totally correct - explain the problem and the tradeoffs. And OP decided the tradeoffs weren’t worth it, which is totally fair.
> What’s a little sad, is NPM should not be operating like a company with 1000+ employees. The “persuade us users want this” approach is only going to stop volunteers. They should be proactively identifying efforts like this and helping you bring it across the finish line.
Says who?
Says an engineer? Says a product person?
NPM is a company with 14 employees; with a system integrated into countless extremely niche and weird integrations they cannot control. Many of those integrations might make a professional engineer's hair catch fire - "it should never be done this way!" - but the real world is that the wrong way is the majority of the time. There's no guarantee that many of the downloads come from the official client, just as one example.
The last thing they need, or I want, or any of their customers want, or their 14 employees need, is something that might break backwards compatibility in an extremely niche case, anger a major customer, cause countless support tickets, all for a tiny optimization nobody cares about.
This is something I've learned here about HN that, for own mental health, I now dismiss: Engineers are obsessed with 2% optimizations here, 5% optimizations there; unchecked, it will literally turn into an OCD outlet, all for things nobody in the non-tech world even notices, let alone asks about. Just let it go.
NPM is a webservice. They could package the top 10-15 enhancements call it V2. When 98% of traffic is V2 turn V1 off. Repeat every 10 years or so until they work their way into having a good protocol.
> Engineers are obsessed with 2% optimizations here, 5% optimizations there; unchecked, it will literally turn into an OCD outlet, all for things nobody in the non-tech world even notices, let alone asks about. Just let it go.
I absolutely disagree with you. If the world took more of those 5% optimisations here and there everything would be faster. I think more people should look at those 5% optimisations. In many cases they unlock knowledge that results in a 20% speed up later down the line. An example from my past - I was tasked with reducing the running speed of a one shot tool we were using at $JOB. It was taking about 15 minutes to run. I shaved off seconds here and there with some fine grained optimisations, and tens of seconds with some modernisation of some core libraries. Nothing earth shattering but improvements none the less. One day, I noticed a pattern was repeating and I was fixing an issue for the third time in a different place (searching a gigantic array of stuff for a specific entry). I took a step back and realised that if I replaced the mega list with a hash table it might fix every instance of this issue in our app. It was a massive change, touching pretty much every file. And all of a sudden our 15 minute runtime was under 30 seconds.
People used this tool every day, it was developed by a team of engineers wildly smarter than me. But it had grown and nobody really understood the impact of the growth. When it started that array was 30, 50 entries. On our project it was 300,000 and growing every day.
Not paying attention to these things causes decay and rot. Not every change should be taken, but more people should care.
I prevent cross-site scripting, I monitor for DDoS attacks, emergency database rollbacks, and faulty transaction handlings. The Internet heard of it? Transfers half a petabyte of data every minute. Do you have any idea how that happens? All those YouPorn ones and zeroes streaming directly to your shitty, little smart phone day after day? Every dipshit who shits his pants if he can't get the new dubstep Skrillex remix in under 12 seconds? It's not magic, it's talent and sweat. People like me, ensuring your packets get delivered, un-sniffed. So what do I do? I make sure that one bad config on one key component doesn't bankrupt the entire fucking company. That's what the fuck I do.
Open source needs to operate differently than a company because people don’t have time/money/energy to deal with bullshit.
Hell. Even 15 employees larping as a corporation is going to be inefficient.
what you and NPM are telling us, is that they are happy to take free labor, but this is not an open source project.
> Engineers are obsessed with 2% optimizations here
Actually in large products these are incredible finds.
But ok. They should have the leadership to know which bandwidth tradeoffs they are committed to and tell him immediately it’s not what they want, rather than sending him to various gatekeepers.
Correct; NPM is not an "open source project" in the sense of a volunteer-first development model. Neither is Linux - over 80% of commits are corporate, and have been for a decade. Neither is Blender anymore - the Blender Development Fund raking in $3M a year calls the shots. Every successful "large" open source project has outgrown the volunteer community.
> Actually in large products these are incredible finds.
In large products, incredible finds may be true; but breaking compatibility with just 0.1% of your customers is also an incredible disaster.
But NPM has no proof their dashboard won't light up full of corporate customers panicking the moment it goes to production; because their hardcoded integration to have AWS download packages and decompress them with a Lambda and send them to an S3 bucket can no longer decompress fast enough while completing other build steps to avoid mandatory timeouts; just as one stupid example of something that could go wrong. IT is also demanding now that NPM fix it rather than modify the build pipeline which would take weeks to validate, so corporate's begging NPM to fix it by Tuesday's marketing blitz.
Just because it's safe in a lab provides no guarantee it's safe in production.
That’s an argument against making any change to the packaging system ever. “It might break something somewhere” isn’t an argument, it’s a paralysis against change. Improving the edge locality of delivery of npm packages could speed up npm installs. But speeding up npm installs might cause the CI system which is reliant on it for concurrency issues to have a race condition. Does that mean that npm can’t ever make it faster either?
This attitude is how in an age with gigabit fiber, 4GB/s hard drive write speed, 8x4 GHz cores with simd instructions it takes 30+ seconds to bundle a handful of files of JavaScript.
While NPM is open source, it's in the awkward spot of also having... probably hundreds of thousands if not millions of professional applications depend on it; it should be run like a business, because millions depend on it.
...which makes it all the weirder that security isn't any better, as in, publishing a package can be done without a review step on the npm side, for example. I find it strange that they haven't doubled down on enterprise offerings, e.g. creating hosted versions (corporate proxies), reviewed / validated / LTS / certified versions of packages, etc.
The problem is that this guy is treating open source as if it was a company where you need to direct a project to completion. Nobody in open source wants to be told what to do. Just release your work, if it is useful, the community will pick it up and everybody will benefit. You cannot force your improvement into the whole group, even if it is beneficial in your opinion.
Imagine being in the middle of nowhere, in winter, on Saturday night, on some farm, knee deep in a cow piss, servicing some 3rd party feed dispenser, only to discover that you have possible solution but it's in some obscure format instead of .tar.gz. Nearest internet 60 miles away. This is what I always imagine happening when some new obscure format come into play, imagine the poor fella, alone, cold, screaming. So incredibly close to his goal, but ultimately stopped by some artificial unnecessary made-up bullshit.
The final pro/cons list: https://github.com/npm/rfcs/pull/595#issuecomment-1200480148
I don't find the cons all that compelling to be honest, or at least I think they warrant further discussion to see if there are workarounds (e.g. a choice of compression scheme for a library like typescript, if they would prefer faster publishes).
It would have been interesting to see what eventually played out if the author hadn't closed the RFC themselves. It could have been the sort of thing that eventually happens after 2 years, but then quietly makes everybody's lives better.
"I don't find the cons all that compelling to be honest"
This is a solid example of how things change at scale. Concerns I wouldn't even think about for my personal website become things I need to think about for the download site being hit by 50,000 of my customers become big deals when operating at the scale of npm.
You'll find those arguments the pointless nitpicking of entrenched interests who just don't want to make any changes, until you experience your very own "oh man, I really thought this change was perfectly safe and now my entire customer base is trashed" moment, and then suddenly things like "hey, we need to consider how this affects old signatures and the speed of decompression and just generally whether this is worth the non-zero risks for what are in the end not really that substantial benefits".
I do not say this as the wise Zen guru sitting cross-legged and meditating from a position of being above it all; I say it looking at my own battle scars from the Perfectly Safe things I've pushed out to my customer base, only to discover some tiny little nit caused me trouble. Fortunately I haven't caused any true catastrophes, but that's as much luck as skill.
Attaining the proper balance between moving forward even though it incurs risk and just not changing things that are working is the hardest part of being a software maintainer, because both extremes are definitely bad. Everyone tends to start out in the former situation, but then when they are inevitably bitten it is important not to overcorrect into terrified fear of ever changing anything.
> This is a solid example of how things change at scale.
5% is 5% at any scale.
Yes and no. If I'm paying $5 a month for storage, I probably don't care about saving 5% of my storage costs. If I'm paying $50,000/month in storage costs, 5% savings is a lot more worthwhile to pursue
Doesn't npm belong to Microsoft? It must be hosted in Azure which they own so they must be paying a rock bottom rate for storage, bandwidth, everything.
It's probably less about MS and more about the people downloading the packages
For them it is 5% of something tiny.
Maybe, maybe not. If you are on a bandwidth limited connection and you have a bunch of NPM packages to install, 5% of an hour is a few minutes saved. It's likely more than that because long-transfers often need to be restarted.
A properly working cache and download manager that supports resume goes a long way.
I could never get Docker to work on my ADSL when it was 2 Mbps (FTTN got it up to 20) though it was fine in the Montreal office which had gigabit.
The amount of modules my docker hosts download from npm is anything but tiny.
5% off your next lunch and 5% off your next car are very much not the same thing.
Those lunches could add up to something significant over time. If you're paying $10 per lunch for 10 years, that's $36,500 which is pretty comparable to the cost of a car.
Which is, then, supporting the fact that scale matter, isn't it?
Here the scale of time is larger and does make the 5$ significant, while it isn't at the scale of a few days.
So what, instead of 50k for a car you spend 47.5k?
If that moves the needle on your ability to purchase the car, you probably shouldn't be buying it.
5% is 5%.
If it takes 1 hour of effort to save 5%:
- Doing 1 hour of effort to save 5% on your $20 lunch is foolhardy for most people. $1/hr is well below US minimum wage. - Doing 1 hour of effort to save 5% on your $50k car is wise. $2500/hr is well above what most people are making at work.
It's not about whether the $2500 affects my ability to buy the car. It's about whether the time it takes me to save that 5% ends up being worthwhile to me given the actual amount saved.
The question is really "given the person-hours it takes to apply the savings, and the real value of the savings, is the savings worth the person-hours spent?"
This is something we often do in our house. We talk about things in terms of hours worked rather than price. I think more people should do it.
By that logic I waste time reading books instead of paying someone else to read them for me.
Paying somebody else to read the book means you don't get the benefit of the book.
Also, this is exactly what you company is doing, paying you to "read the book" so they don't have to.
If you can get the exact same result for less cost (time and money), why not? Things like enjoyment don't factor in since they can't be directly converted into money.
Why do so many people take illustrative examples literally?
I'm sure you can use your imagination to substitute "lunch" and "car" with other examples where the absolute change makes a difference despite the percent change being the same.
Even taking it literally... The 5% might not tip the scale of whether or not I can purchase the car, but I'll spend a few hours of my time comparing prices at different dealers to save $2500. Most people would consider it dumb if you didn't shop around when making a large purchase.
On the other hand, I'm not going to spend a few hours of my time at lunch so that I can save an extra $1 on a meal.
I wouldn't pick 5¢ up off the ground but I would certainly pick up $2500.
You'd keep 5c. A significant number of people who find sums up around $2500 give it back unconditionally, with no expectation of reward. Whoever lost $2500 is having a really bad day.
5% of newly published packages, with a potentially serious degradation to package publish times for those who have to do that step.
Given his numbers, let's say he saves 100Tb of bandwidth over a year. At AWS egress pricing... that's $5,000 total saved.
And arguably - NPM is getting at least some of that savings by adding CPU costs to publishers at package time.
Feels like... not enough to warrant a risky ecosystem change to me.
https://www.reddit.com/r/webdev/comments/1ff3ps5/these_5000_...
NPM uses at least 5 petabytes per week. 5% of that is 250 terabytes.
So $15,000 a week, or $780,000 a year in savings could’ve been gained.
In a great example of the Pareto Principle (80/20), or actually even more extreme, let's only apply this Zopfli optimization if the package download total is equal or more than 1GiB (from the Weekly Traffic in GiB column of the Top 5000 Weekly by Traffic tab of the Google Sheets file from the reddit post).
For reference, total bandwidth used by all 5000 packages is 4_752_397 GiB.
Packages >= 1GiB bandwidth/week - That turns out to be 437 packages (there's a header row, so it's rows 2-438) which uses 4_205_510 GiB.
So 88% of the top 5000 bandwidth is consumed by downloading the top 8.7% (437) packages.
5% is about 210 TiB.
Limiting to the top 100 packages by bandwidth results in 3_217_584 GiB, which is 68% of total bandwidth used by 2% of the total packages.
5% is about 161 TiB.
Packages with >= 20GiB bandwidth == 47 packages totaling 2,536,902.81 GiB/week.
Less than 1% of top 5000 packages took 53% of the bandwidth.
5% would be about 127 TiB (rounded up).
How often are individuals publishing to NPM? Once a day at most, more typically once a week or month? A few dozen seconds of one person's day every month isn't a terrible trade-off.
Even that's addressable though if there's motivation, since something like transcoding server side during publication just for popular packages would probably get 80% of the benefit with no client-side increase in publication time.
In some scenarios the equation flips, and the enterprise is looking for _more_ scale.
The more bandwidth that Cloudflare needs, the more leverage they have at the peering table. As GitHub's largest repo (the @types / DefinitelyTyped repo owned by Microsoft) gets larger, the more experience the owner of GitHub (also Microsoft) gets in hosting the world's largest git repos.
I would say this qualifies as one of those cases, as npmjs is hosted on Azure. The more resources that NPM needs, the more Microsoft can build towards parity with AWS's footprint.
That's right, and 5% of a very small number is a very small number. 5% of a very big number is a big number.
Do you even know how absolute numbers work vis-à-vis percentages?
I agree with everything you said, but it doesn’t contradict my point
I'm saying you probably don't find them compelling because from your point of view, the problems don't look important to you. They don't from my point of view either. But my point of view is the wrong point of view. From their point of view this would be plenty to make me think twice and several times over past that from changing something so deeply fundamental to the system for what is a benefit that nobody who is actually paying the price for the package size seems to be particularly enthusiastic about. If the people paying the bandwidth bill aren't even that excited about a 5% reduction, then the cost/benefits analysis tips over into essentially "zero benefit, non-zero cost", and that's not very compelling.
The problems look important but underexplored
Or you're not understanding how he meant it: there are countless ways to roll out such changes, a hard change is likely a very bad idea as you've correctly pointed out.
But it is possible to do it more gradually, I.e. by sneaking it in with a new API that's used by new npm version or similar.
But it was his choice to make, and it's fine that he didn't feel enough value in pursuing such a tiny file size change
I feel massively increasing publish time is a valid reason not to push this though considering such small gains and who the gains apply to.
I agree, going from 1 second to 2.5 minutes is a huge negative change, in my opinion. I know publishing a package isn't something you do 10x a day but it's probably a big enough change that, were I doing it, I'd think the publish process is hanging and keep retrying it.
If you’re working on the build process itself, you’ll notice it a lot!
Since it's backwards compatible, individual maintainers could enable it in their own pipeline if they don't have issues with the slowdown. It sounds like it could be a single flag in the publish command.
Probably not worth the added complexity, but in theory, the package could be published immediately with the existing compression and then in the background, replaced with the Zopfli-compressed version.
> Probably not worth the added complexity, but in theory, the package could be published immediately with the existing compression and then in the background, replaced with the Zopfli-compressed version.
Checksum matters aside, wouldn't that turn the 5% bandwidth savings into an almost double bandwidth increase though? IMHO, considering the complexity to even make it a build time option, the author made the right call.
No, it can't because the checksums won't match.
I don't think that's actually a problem, but it would require continuing to host both versions (at distinct URLs) for any users who may have installed the package before the Zopfli-compressed version completed. Although I think you could also get around this by tracking whether the newly-released package was ever served by the API. If not, which is probably the common case, the old gzip-compressed version could be deleted.
Wouldn't that result in a different checksum for package-lock.json?
The pros aren't all that compelling either. The npm repo is the only group that this would really be remotely significant for, and there seemed to be no interest. So it doesn't take much of a con to nix a solution to a non-problem.
Every single download, until the end of time is affected: It speeds up the servers, speeds up the updates, saves disk space on the update servers, and saves on bandwidth costs and usage.
Everyone benefits, the only cost is a ultra microscopic time on the front end, and a tiny cost on the client end, and for a very significant number of users, time and money saved. The examples of compression here...
Plus a few years of a compression expert writing a JS implementation of what was likely some very cursed C. And someone auditing its security. And someone maintaining it.
I felt the same. The proposal wasn't rejected! Also, performance gains go beyond user stories - e.g. they reduce infra costs and environmental impact - so I think the main concerns of the maintainers could have been addressed.
> The proposal wasn't rejected!
They soft-rejected by requiring more validation than was reasonable. I see this all the time. "But did you consider <extremely unlikely issue>? Please go and run more tests."
It's pretty clear that the people making the decision didn't actually care about the bandwidth savings, otherwise they would have put the work in themselves to do this, e.g. by requiring Zopfli for popular packages. I doubt Microsoft cares if it takes an extra 2 minutes to publish Typescript.
Kind of a wild decision considering NPM uses 4.5 PB of traffic per week. 5% of that is 225 TB/week, which according to my brief checks costs around $10k/week!
I guess this is a "not my money" problem fundamentally.
This doesn't seem quite correct to me. They weren't asking for "more validation than was reasonable". They were asking for literally any proof that users would benefit from the proposal. That seems like an entirely reasonable thing to ask before changing the way every single NPM package gets published, ever.
I do agree that 10k/week is non-negligible. Perhaps that means the people responsible for the 10k weren't in the room?
> which according to my brief checks costs around $10k/week
That's the market price though, for Microsoft its a tiny fraction of that.
Or another way to look at it is it's just (at most!) 5% off an already large bill, and it might cost more than that elsewhere.
And I can buy 225 TB of bandwidth for less than $2k, I assume Microsoft can get better than some HN idiot buying Linode.
> And I can buy 225 TB of bandwidth for less than $2k
Even so, $2k a week is at least one competent FTE.
massively increase the open source github actions bill for runners running longer (compute is generally more expensive) to publish for a small decrease in network traffic (bandwidth is cheap at scale)?
> I don't find the cons all that compelling to be honest
I found it reasonable.
The 5% improvement was balanced against the cons of increased cli complexity, lack of native JS zopfli implementation, and slower compression .. and 5% just wasn't worth it at the moment - and I agree.
>or at least I think they warrant further discussion
I think that was the final statement.
Yes, but there’s a difference between “this warrants further discussion” and “this warrants further discussion and I’m closing the RFC”. The latter all but guarantees that no further discussion will take place.
No it doesn't. It only does that if you think discussion around future improvements belongs in RFCs.
Where DOES it belong, if not there?
> I don't find the cons all that compelling to be honest, or at least I think they warrant further discussion
It needs a novel JS port of a C compresison library, which will be wired into a heavily-used and public-facing toolchain, and is something that will ruin a significant number of peoples' days if it breaks.
For me, that kind of ask needs a compelling use case from the start.
We wouldn't have to worry about over-the-wire package size if the modern DevOps approach wasn't "nuke everything, download from the Internet" every build.
Back in my Java days, most even small-time dev shops had a local Maven registry that would pass through and cache the big ones. A CI job, even if the "container" was nuked before each build, would create maybe a few kilobytes of Internet traffic, possibly none at all.
Now your average CI job spins up a fresh VM or container, pulls a Docker base image, apt installs a bunch of system dependencies, pip/npm/... installs a bunch of project dependencies, packages things up and pushes the image to the Docker registry. No Docker layer caching because it's fresh VM, no package manager caching because it's a fresh container, no object caching because...you get the idea....
Even if we accept that the benefits of the "clean slate every time" approach outweigh the gross inefficiency, why aren't we at least doing basic HTTP caching? I guess ingress is cheap and the egress on the other side is "someone else's money".
After reading the article, this comment and the comment thread further down on pnpm[1], it feels to me like the NPM team are doing everyone a disservice by ignoring the inefficiencies in the packaging system. It may not be deliberate or malicious but they could easily have provided better solutions than the one proposed in the article which, in my opinion is a band-aid solution at best. The real fix would be to implement what you mention here: local registry and caching, and/or symlinking a la pnpm.
[1] https://news.ycombinator.com/item?id=42841658
Lots of places use a cache like Artifactory so they don't get slammed with costs, and are resilient to network outages and dependency builds vanishing.
In every org I've worked with, we had a local dependency mirror in the GitOps architecture.
I really don't want to go back to the old world where every part of your build is secretly stateful and fails in mysterious hard to reproduce ways.
You can and should have your own caching proxy for all your builds but local caches are evil.
Yeah, I would also note that in addition to speed/transfer-costs, having an organizational package proxy is useful for reproducibility and security.
Last I checked npm packages were full of garbage including non-source code. There's no reason for node_modules to be as big as it usually is, text compresses extremely well. It's just general sloppiness endemic to the JavaScript ecosystem.
It's not even funny:
(clipboardy ships executables and none of them can be run on NixOS btw)I don't know why, but clipboard libraries tend to be really poorly implemented, especially in scripting languages.
I just checked out clipboardy and all they do is dispatch binaries from the path and hope it's the right one (or if it's even there at all). I think I had a similar experience with Python and Lua scripts. There's an unfunny amount of poorly-written one-off clipboard scripts out there just waiting to be exploited.
I'm only glad that the go-to clipboard library in Rust (arboard) seems solid.
Are they reproducible? Shipping binaries in JS packages is dodgy AF - a Jia Tan attack waiting to happen.
The executables are vendored in the repo [0].
[0] https://github.com/sindresorhus/clipboardy/tree/main/fallbac...
That's on the package publishers, not NPM. They give you an `.npmignore` that's trivially filled out to ensure your package isn't full of garbage, so if someone doesn't bother using that: that's on them, not NPM.
(And it's also a little on the folks who install dependencies: if the cruft in a specific library bothers you, hit up the repo and file an issue (or even MR/PR) to get that .npmignore file filled out. I've helped folks reduce their packages by 50+MB in some cases, it's worth your own time as much as it is theirs)
It's much better to allowlist the files meant to be published using `files` in package.json because you never know what garbage the user has in their folder at the time of publish.
On a typical project with a build step, only a `dist` folder would published.
Not a fan of that one myself (it's far easier to tell what doesn't belong in a package vs. what does belong in a package) but that option does exist, so as a maintainer you really have no excuse, and as a user you have multiple MR/PRs that you can file to help them fix their cruft.
> On a typical project with a build step, only a `dist` folder would published.
Sort of, but always include your docs (readme, changelog, license, and whatever true docs dir you have, if you have one). No one should need a connection for those.
You might be interested in e18e if you would like to see that change: https://e18e.dev/
They’ve done a lot of great work already.
Does this replace ljharb stuff?
Yep, I wrote a script that starts at a root `node_modules` folder and iterates through to remove anything not required (dotfiles, Dockerfile, .md files, etc) - in one of our smaller apps this removes about 25Mb of fluff, some packages are up to 60-70mb of crap removed.
Totally agree with you. I wish npm did a better job of filtering the crap files out of packages.
One of the things I like about node_modules is that it's not purely source code and it's not purely build artifacts.
You can read the code and you can usually read the actual README/docs/tests of the package instead of having to find it online. And you can usually edit library code for debugging purposes.
If node_modules is taking up a lot of space across a bunch of old projects, just write the `find` script that recursively deletes them all; You can always run `npm install` in the future when you need to work on that project again.
At least, switch to pnpm minimize the bloat
As someone who mostly works in Java it continues to floor me that this isn’t the default. Why does every project I work on need an identical copy of possibly hundreds of packages if they’re the same version?
I also like Yarn pnp’s model of leaving node_modules as zip files. CPUs are way faster than storage, they can decompress on the fly. Less disk space at rest, less disk slack, less filesystem bookkeeping.
Every single filesystem is way faster at dealing with one file than dozens/hundreds. Now multiply that by the the hundreds if does, it add up.
I just installed a project with pnpm about 120 packages mostly react/webpack/eslint/redux related
with prod env: 700MB
without prod env: 900MB
sadly the bloat cannot be avoided that well :/
pnpm stores them in a central place and symlinks them. You’ll see the benefits when you have multiple projects with a lot of the same packages.
You'll also see the benefit when `rm -rf`ing a `node_modules` and re-installing, as pnpm still has a local copy that it can re-link after validating its integrity.
I believe I knocked 10% off of our node_modules directory by filing .npmignore PRs or bug reports to tools we used.
Now if rxjs weren’t a dumpster fire…
Props to anyone who tries to make the world a better place.
Its not always obvious who has the most important use cases. In the case of NPM they are prioritizing the user experience of module authors. I totally see how this change would be great for module consumers, yet create potentially massive inconvenience for module authors.
Interesting write-up
I think "massive" is overstating it. I don't think deploying a new version of a package is something that happens many times a day, so it wouldn't be a constant pain point.
Also, since this is a case of having something compressed once and decompressed potentially thousands of times, it seems like the perfect tool for the job.
Module authors generally have fairly large test suites which are run often- sometimes on each file save. If you have a 1 or 2 second build script its not a huge deal. If that script starts taking 30-60 seconds- you have just hosed productivity. Also you have massively increased the load on your CI server- possibly bumping you out of a free tier.
The fix would then have to be some variation of:
a) Stop testing (so often)
b) Stop bundling before testing
c) Publish to a different package manager
- all of which would affect the overall quality and quantity npm modules.
In that case, I don't understand why you would bundle the package every time you run tests. What does that do?
You have to test that your bundle _actually works_, especially if you are using non-standard compression.
But yes- you could bundle less, but that would be a disadvantage, particularly if a bundle suddenly fails and you don’t know which change caused it. But maybe that’s not a big deal for your use case.
Every build in a CI system would probably create the package.
This is changing every build in every CI system to make it slower.
Just use it on the release build.
A few people have mentioned the environmental angle, but I'd care more about if/how much this slows down decompression on the client. Compressing React 20x slower once is one thing, but 50 million decompressions being even 1% slower is likely net more energy intensive, even accounting for the saved energy transmitting 4-5% fewer bits on the wire.
It's very likely zero or positive impact on the decompression side of things.
Starting with smaller data means everything ends up smaller. It's the same decompression algorithm in all cases, so it's not some special / unoptimized branch of code. It's yielding the same data in the end, so writes equal out plus or minus disk queue fullness and power cycles. It's _maybe_ better for RAM and CPU because more data fits in cache, so less memory is used and the compute is idle less often.
It's relatively easy to test decompression efficiency if you think CPU time is a good proxy for energy usage: go find something like React and test the decomp time of gzip -9 vs zopfli. Or even better, find something similar but much bigger so you can see the delta and it's not lost in rounding errors.
I can speak to this - there is no meaningful decompression effect across an insane set of tested data at Google and elsewhere. Zopfli was invented prior to brotli
Zopfli is easiest to think of as something that just tries harder than gzip to find matches and better encodings. Much harder.
decompression speed is linear either way.
It's easiest to think of decompression as a linear time vm executor[1], where the bytecoded instructions are basically
go back <distance> bytes, output the next <length> bytes you see, then output character <c>
(outputting literal data is the instruction <0,0,{character to output}>)
Assuming you did not output a file larger than the original uncompressed file (why would you bother?), you will, worst case, process N bytes during decompression, where N is the size of the original input file.
The practical decompression speed is driven by cache behavior, but it thrashes the cache no matter what.
In practice, reduction of size vs gzip occurs by either finding larger runs, or encodings that are smaller than the existing ones.
After all, if you want the compressed file to shrink, you need output less instructions somehow, or make more of the instructions identical (so they can be represented in less bits by later huffman coding).
In practice, this has almost exclusively positive effects on decompression speed - either the vm has less things to process (which is faster), or more of the things it does look the same (which has better cache behavior).
[1] this is one way archive formats will sometimes choose to deal with multiple compression method support - encode them all to the same kind of bytecode (usually some form of copy + literal instruction set), and then decoding is the same for all of them. ~all compression algorithms output some bytecode like the above on their own already, so it's not a lot of work. This doesn't help you support other archive formats, but if you want to have a bunch of per-file compression options that you pick from based on what works best, this enables you to still only have to have one decoder.
For formats like deflate, decompression time doesn't generally depend on compressed size. (zstd is similar, though memory use can depend on the compression level used).
This means an optimization like this is virtually guaranteed to be a net positive on the receiving end, since you always save a bit of time/energy when downloading a smaller compressed file.
This seems like a place where the more ambitious version that switches to ZSTD might have better tradeoffs. You would get similar or better compression, with faster decompression and recompression than zstd.It would lose backward compatibility though...
Not necessarily - could retain backward compat by publishing both gzip and zstd variants and having downloaders with newer npm’s prefer to download zstd. Over time, you could require packages only upload zstd going forward and either generate zstd versions of the backlog of unmaintained packages or at least those that see some amount of traffic over some time period if you’re willing to drop very old packages. The ability to install arbitrary versions of packages probably means you’re probably better off reprocessing the backlog although that may cost more than doing nothing.
The package lock checksum is probably a more solvable issue with some coordination.
The benefit of doing this though is less immediate - it will take a few years to show payoff and these kinds of payoffs are not typically made by the kind of committee decisions process described (for better or worse).
Brotli and lzo1b have good compression ratios and pretty fast decompression speeds. Compression speed should not matter that much, since you only do it once.
https://quixdb.github.io/squash-benchmark/
There even more obscure options:
https://www.mattmahoney.net/dc/text.html
Thats a much higher hurdle to jump. I don’t blame the author for trying this first.
If accepted, it might have been a good stepping stone too. A chance to get to know everyone and their concerns and how they think.
So if you wanted to see how this works (proposal + in prod) and then come back later proposing something bigger by switching off zip that would make sense to me as a possible follow up.
Years back I came to the conclusion that conda using bzip2 for compression was a big mistake.
Back then if you wanted to use a particular neural network it was meant for a certain version of Tensorflow which expected you to have a certain version of the CUDA libs.
If you had to work with multiple models the "normal" way to do things was use the developer unfriendly [1][2] installers from NVIDIA to install a single version of the libs at a time.
Turned out you could have many versions of CUDA installed as long as you kept them in different directories and set the library path accordingly, it made sense to pack them up for conda and install them together with everything else.
But oh boy was it slow to unpack those bzip2 packages! Since conda had good caching, if you build environments often at all you could be paying more in decompress time than you pay in compression time.
If you were building a new system today you'd probably use zstd since it beats gzip on both speed and compression.
[1] click... click... click...
[2] like they're really going to do something useful with my email address
>But oh boy was it slow to unpack those bzip2 packages! Since conda had good caching, if you build environments often at all you could be paying more in decompress time than you pay in compression time.
For Paper, I'm planning to cache both the wheel archives (so that they're available without recompressing on demand) and unpacked versions (installing into new environments will generally use hard links to the unpacked cache, where possible).
> If you were building a new system today you'd probably use zstd since it beats gzip on both speed and compression.
FWIW, in my testing LZMA is a big win (and I'm sure zstd would be as well, but LZMA has standard library support already). But there are serious roadblocks to adopting a change like that in the Python ecosystem. This sort of idea puts them several layers deep in meta-discussion - see for example https://discuss.python.org/t/pep-777-how-to-re-invent-the-wh... . In general, progress on Python packaging gets stuck in a double-bind: try to change too little and you won't get any buy-in that it's worthwhile, but try to change too much and everyone will freak out about backwards compatibility.
I designed a system which was a lot like uv but written in Python and when I looked at the politics I decided not to go forward with it. (My system also had the problem that it had to be isolated from other Pythons so it would not get its environment trashed, with the ability for software developers to trash their environment I wasn't sure it was a problem that could be 100% solved. uv solved it by not being written in Python. Genius!)
Yes, well - if I still had reason to care about the politics I'd be in much the same position, I'm sure. As is, I'm going to just make the thing, write about it, and see who likes it.
One thing that's excellent about zopfli (apart from being gzip compatible) is how easy it is to bootstrap:
It just requires a C compiler and linker.The main downside though, it's impressively slow.
Comparing to gzip isn't really worth it. Combine pigz (threaded) with zlib-ng (simd) and you get decent performance. pigz is used in `docker push`.
For example, gzipping llvm.tar (624MB) takes less than a second for me:
At the same time, zopfli compiled with -O3 -march=native takes 35 minutes. No wonder it's not popular.It is almost 2700x slower than the state of the art for just 6.8% bytes saved.
> 2700x slower
That is impressively slow.
In my opinion even the 28x decrease in performance mentioned would be a no-go. Sure the package saves a few bytes but I don't need my entire pc to grind to a halt every time I publish a package.
Besides, storage is cheap but CPU power draw is not. Imagine the additional CO2 that would have to be produced if this RFC was merged.
> 2 gigabytes of bandwidth per year across all installations
This must be a really rough estimate and I am curious how it was calculated. In any case 2 gigabytes over a year is absolutely nothing. Just my home network can produce a terabyte a day.
2 GB for the author's package which is neither extremely common nor large; it would be 2 TB/year just for react core.
I am confused, how is this number calculated?
Because the authors mentioned package, Helmet[1], is 103KB uncompressed and has had 132 versions in 13 years. Meaning downloading every Helmet version uncompressed would result in 132*103KB = 13.7MB.
I feel like I must be missing something really obvious.
Edit: Oh it's 2GB/year across all installations.
[1]: https://www.npmjs.com/package/helmet?activeTab=versions
Congrats on a great write-up. Sometimes trying to ship something at that sorta scale turns out to just not really make sense in a way that is hard to see at the beginning.
Another personal win is that you got a very thorough understanding of the people involved and how the outreach parts of the RFC process works. I've also had a few fail, but I've also had a few pass! Always easier to do the next time
Pulling on this thread, there are a few people who have looked at the ways zopfli is inefficient. Including this guy who forked it, and tried to contribute a couple improvements back to master:
https://github.com/fhanau/Efficient-Compression-Tool
These days if you’re going to iterate on a solution you’d better make it multithreaded. We have laptops where sequential code uses 8% of the available cpu.
> These days if you’re going to iterate on a solution you’d better make it multithreaded.
Repetition eliminating compression tends to be inherently sequential. You'd probably need to change the file format to support chunks (or multiple streams) to do so.
Because of LZ back references, you can't LZ compress different chunks separately on different cores and have only one compression stream.
Statistics acquisition (histograms) and entropy coding could be parallel I guess.
(Not a compression guru, so take above with a pinch of salt.)
There are gzip variants that break the file into blocks and run in parallel. They lose a couple of % by truncating the available history.
But zopfli appears to do a lot of backtracking to find the best permutations for matching runs that have several different solutions. There’s a couple of ways you could run those in parallel. Some with a lot of coordination overhead, others with a lot of redundant calculation.
I wonder if it would make more sense to pursue Brotli at this point, Node has had it built-in since 10.x so it should be pretty ubiquitous by now. It would require an update to NPM itself though.
+1 to brotli. Newly published packages could use brotli by default, so old ones stay compatible.
Here's the Brotli supporter's blog post about adding Brotli support to NPM packages.
https://jamiemagee.co.uk/blog/honey-i-shrunk-the-npm-package...
and the related HN discussion from that time:
https://news.ycombinator.com/item?id=37754489
Nice write up!
> When it was finally my turn, I stammered.
> Watching it back, I cringe a bit. I was wordy, unclear, and unconvincing.
> You can watch my mumbling in the recording
I watched this, and the author was articulate and presented well. The author is too harsh!
Good job for trying to push the boundaries.
This reminds me of a time I lost an argument with John-David Dalton about cleaning up/minifying lodash as an npm dependency, because when including the readme and license for every sub-library, a lodash import came to ~2.5MB at the time. This also took a lot of seeking time for disks because there were so many individual files.
The conversation started and ended at the word cache.
> This also took a lot of seeking time for disks because there were so many individual files.
The fact NPM keeps things in node_modules unzipped seems wild to me. Filesystems are not great at hundreds of thousands of little files. Some are bad, others are terrible.
Zip files are easier to store, take up less space, and CPUs are faster than disks so the decompression in memory is probably faster reading the unzipped files.
That was one of my favorite features of Yarn when I tried it - pnp mode. But since it’s not what NPM does it requires a shim that doesn’t work with all ps mage’s. Or at least didn’t a few years ago.
I'd love to see an effort like like this succeed in the Python ecosystem. Right now, PyPI is dependent upon Fastly to serve files, on the order of >1 petabyte per day. That's a truly massive in-kind donation, compared to the PSF's operating budget (only a few million dollars per year - far smaller than Linux or Mozilla).
No problem, I'm sure if Fastly stopped doing it JiaTanCo would step up
I don't see why it wouldn't be possible to hide behind a flag once Node.js supports zopfli natively. In case of CI/CD, it's totally feasible to just add a --strong-compression flag. In that case, the user expects it to take its time.
TS releases a non-preview version every few months, so using 2.5 minutes for compression would work.
Think of the complexity involved, think of having to fix bugs because something went wrong.
Such effort would be better spent preparing npm/node for a future where packages with a lower-bound npm version constraint can be compressed with zstd or similar.
Every feature you add to a program is complexity, you need to really decide if you want it.
What about a different approach - an optional npm proxy that recompresses popular packages with 7z/etc in the background?
Could verify package integrity by hashing contents rather than archives, plus digital signatures for recompressed versions. Only kicks in for frequently downloaded packages once compression is ready.
Benefits: No npm changes needed, opt-in only, potential for big bandwidth savings on popular packages. Main tradeoff is additional verification steps, but they could be optional given a digital signature approach.
Curious if others see major security holes in this approach?
This felt like the obvious way to do things to me: hash a .tar file, not a .tar.gz file. Use Accept-Encoding to negotiate the compression scheme for transfers. CDN can compress on the fly or optionally cache precompressed files. i.e. just use standard off-the-shelf HTTP features. These days I prefer uncompressed .tar files anyway because ZFS has transparent zstd, so decompressed archive files are generally smaller than a .gz.
> hash a .tar file, not a .tar.gz file
For security reasons, it's usually better to hash the compressed file, since it reduces the attack surface: the decompressor is not exposed to unverified data. There have already been vulnerabilities in decompressor implementations which can be exploited through malformed compressed data (and this includes IIRC at least one vulnerability in zlib, which is the standard decompressor for .gz).
This suggests one should just upload a tar rather than a compressed file. Makes sense because one can scan the contents for malicious files without risking a decompressor bug.
BTW npm decompressed all packages anyhow because it lets you view the contents these days on its website.
You are correct. They should be uploading and downloading dumb tar files and let the HTTP connection negotiate the compression method. All hashes should be based on the uncompressed raw tar dump. This would be proper separation of concerns.
Enjoy zipbomb.js
But npm already decompresses every package because it shows the contents on its website. So yeah it can be malicious but it already has dealt with that risk.
I created once a maven plugin to recompress Java artefacts with zopfli. I rewrote it in Java and runs entirely in the JVM. This means, the speed is worse and may contain bugs:
https://luccappellaro.github.io/2015/03/01/ZopfliMaven.html
My experiment on how to reduce javascript size of every web app by 30-50% : https://github.com/avodonosov/pocl
Working approach, but in the end I abandoned the project - I doubt people care about such js size savings.
I got measurable decreases in deployment time by shrinking the node_modules directory in our docker images.
I think people forget that, when you’re copying the same images to dozens and dozens of boxes, any improvement starts to add up to real numbers.
I've not done it, but have you considered using `pnpm` and volume-mounting a shared persistent `pnpm-store` into the containers? It seems like you'd get near-instant npm installs that way.
The only time npm install was on the critical path was hotfixes. It’s definitely worth considering. But I was already deep into doing people giant favors that they didn’t even notice, so I was juggling many other goals. I think the only thank you I got was from the UI lead, who had some soda straw internet connection and this and another thing I did saved him a bunch of hard to recover timeouts.
In this approach the size of deployment bundles / images is not necessarily reduced.
Reduced is the size of javascript loaded into the end user's browser.
Wdym?? 50% is a big deal
Why big deal?
50% is just a big O of the original size :)
If we consider overall Internet traffic, which is dominated by video and images, the size of javascript transmitted is negligible.
Savings of memory and cpu for end user's browser? Maybe, but our software at all layers is so bloated, without need, just carelessly, that I'm not sure working of such javascript savings is useful - any resources saved will be eaten by something else immediately.
For an application developer or operator, the 50% savings of javascript size are probably not worth dealing with some tool that dynamically prunes the app code, raising questions about privacy and security. I though through the security and privacy questions, but who would want to even spend attention on these considerations?
As I mention in the README at github, a Microsoft researcher investigated the same approach earlier, but Microsoft haven't taken it anywhere.
There was a commercial company offering this approach of javascript minification as a product, complete product. As well as other optimizations, like image size, etc. Their proxy embedded special "agent" code into the app which inspected the device and reported to server what image sizes are optimal for user, what js functions are invoked. And the server prepared optimized versions of app for various devices. Javascript was "streamed" in batches of only functions needed by the app.
Now the company is dissolved - that's why the website is unavailable. Wikipedia says they were bought out by Akamai - https://en.wikipedia.org/wiki/Instart. But I don't see any traces of this approach in the today's Akamai offerings.
I contacted several companies, in CDN business and others, trying to interest them in the idea and get very modest funding for a couple more months of my time to work on this (I was working on this in the end of a long break from payed work and was running out of savings). Didn't find anyone ready to take part.
This all may be signs that possibility of such an optimization is not valuable enough for users.
50% size savings isn't important to the people who pay for it. They pay at most pennies for 100% savings (that is somehow all the functionality in zero bytes - not worth anything to those paying the bills)
Size savings translates to latency improvements which directly affects conversion rates. Smaller size isn’t about reducing costs but increased revenue. People care.
Note that this proof-of-concept implementation saves latency on first load, but may add latency at surprising points while using the website. Any user invoking a rarely-used function would see a delay before the javascript executes, without the traditional UI affordances (spinners etc) to indicate that the application was waiting on the network. Further, these secretly-slow paths may change from visit to visit. Many users know how to "wait for the app to be ready," but the traditional expectation is that once it's loaded, the page/app will work, and any further delays will be signposted.
I'm sure it works great when you've got high-speed internet, but might break things unacceptably for users on mobile or satellite connections.
Only the first user who hits a rarely used execution point may experience the noticeable latency, if he also has slow internet, etc.
As soon as the user executes a rarely used function, the information is about this fact is sent to the server and it includes this function into the active set to be send to future users.
In the video I manually initiate re-generation of the "active" and the "rest" scripts, but the most primitive MVP was supposed to schedule re-generation of the scripts when receiving info that some previously unseen functions are executed in browser.
Obviously, if the idea is developed further, the first user's experience may also be improved - a spinner shown. Pre-loading the inactive set of functions in background may also be considered (pros: it allows to avoid latency for users ho invoke rare functionality, cons: we lose the savings of the traffic and browser memory and cpu for compiling the likely unneded code).
(BTW, further development of the idea includes to splitting the code more granularity than just "active" / "inactive". E.g. active in the first 5 seconds after opening the page loaded immediately, likely to be active soon, active but rarely called - the later two these parts definitely need to be loaded in background)
> without the traditional UI affordances (spinners etc) to indicate that the application was waiting on the network.
This part is obviously trivially solvable. I think the same basic idea is going to at some point make it but it’ll have to be through explicit annotations first and then there will be tooling to automatically do this for your code based upon historical visits where you get to tune the % of visitors that get additional fetches. Also, you could probably fetch the split off script in the background anyway as a prefetch + download everything rather than just 1 function at a time (or even downloading related groups of functions together)
The idea has lots of merit and you just have to execute it right.
I agree with most of your comment, but don't get what you mean about explicit annotations.
Note that most of the unused code is located in libraries, not in the app code directly.
split async function handle_button_press() { … }
This would cause the bundler to inject a split point & know how to hide that + know what needs bundling and what doesn’t. GWT pioneered almost 20 years ago although not a fan of the syntax they invented to keep everything running within stock Java syntax: https://www.gwtproject.org/doc/latest/DevGuideCodeSplitting....
Why do you want them explicit in the code?
The tooling producing such annotaion based on historical visits whould constantly change the annotated set of functions, I suspect, whith app versions evolving.
Note, the inactive code parts are very often in 3rd party libraries.
Some are also in your own libraries shared between your different application. So the same shared library whould need one set of split-annotated functions for one app, and another set for another app. So these conflicting sets of annotations can not live in the library source code simultaneously.
For example the automated system could take your non-split codebase & inject splits at any async function & self-optimize as needed. The manual annotation support would be to get the ecosystem going as an initial step because I suspect targeting auto-optimizing right from the get go may be too big a pill to swallow.
Who? anyone who is on a slow internet connection or who has a slow device can tell you they don't care. Or maybe they do but features are far more important. I guess if things are slow on a top of the line device withia fast connection they would care.
Agreed - often a CTO of an ecom site is very very focused on site speed and has it as their #1 priority since it directly increases revenue.
How do you evaluate call usage?
By instrumenting the code so that the function records the fact that it is being invoked.
Then the info about the called functions is sent back to the server.
(Only the functions never seen to be called are instrumented, the known active functions are not instrumented).
I think this is called tree shaking and Vite/Rollup do this by default these days. Of course, it's easy when you explicitly say what you're importing.
That's not tree-shaking.
Oh, I guess you would need something like dynamic imports to not include uncommonly used functionality in the main bundle
This strikes me as something that could be done for the highest traffic packages at the backend, rather than be driven by the client at pubish-time.
The article talks about this. There are hashes that are generated for the tarball so the backend can't recompress anything.
A pro could have been an extra narrative about carbon footprint savings.
I'm surprised it hasn't been raised when talking about saving 2Tb/year only for React. It represents costs, which doesn't seem to be an issue, but also computing power and storage. (Event with a higher/longer computing power due to slower compression, it's done once per version, which isn't really comprable to the amount of downloads anyway)
Hard to calculate the exact saving, but it would represent a smaller CO2 footprint..
It only doesn't apply to existing versions of existing packages. Newer releases would apply Zopfli, so over time likely the majority of actively used/maintained packages would be recompressed.
These days technology moves so fast it's hard to keep up. The slowest link in the system is the human being.
That's a strong argument that 'if it isn't broke, don't fix it."
LOts of numbers being thrown around, you add up tiny things enough times you can get a big number. But is npm package download the thing that's tanking the internet? No? Then this is a second- or thirt-order optimization.
> Integrating Zopfli into the npm CLI would be difficult.
Is it possible to modify "gzip -9" or zlib to invoke zopfli? This way everyone who wants to compress better will get the extra compression automatically, in addition to npm.
There will be an increase in compression time, but since "gzip -9" is not the default, people preferring compression speed might not be affected.
You'd have more problems here, but you could do it - if you let it take ages and ages to percolate though all environments.
It's been almost 30 years since bzip2 was released and even now not everything can handle tar.bz2
probably because bzip2 isn't a very good format
I used zopflipng in the past to optimize PNG images. It made sense since there was no better alternative to store lossless image data than the PNG format at the given time in the given environment. Zopfli is awesome when you are locked in on deflate compression. I feel like if the npm folks would want to optimize for smaller package size a better strategy would be switching to some more effective text compression (e.g. bzip2, xz). That would result into a larger file size reduction than 5% for a smaller CPU time increase compared to Zopfli. You would need to come up with some migration strategy though as this change isn't per-se backwards compatible, but that seems manageable if you are in control of the tooling and infrastructure.
I ll give you an even better idea but it ll need atleast a 100 volunteers, maybe a 1000. Take each package and rewrite it without external dependencies. That will cut tech debt for that package significantly. Just like how we have a @types/xyz where some dude named DefinitelyTyped is busy making typescript packages for everything, lets make a namespace like @efficient/cors @efficient/jsdom @efficient/jest etc and eliminate all external dependencies completely for every library on npm
If OP wanted to shrink nom packages then npm could introduce two types of npm package - the build package and source one. This way a lot of packages would be smaller, because package code could be safely distributed through a separate package and not kept in the build package. There are a lot npm packages that explicitly include whole git repo in the package and they do so because there's only one type of package they can use
From the RFC on github[1].
> Zopfli is written in C, which presents challenges. Unless it was added to Node core, the CLI would need to (1) rewrite Zopfli in JS, possibly impacting performance (2) rely on a native module, impacting reliability (3) rely on a WebAssembly module. All of these options add complexity.
Wow! Who's going to tell them that V8 is written in C++? :)
[1]: https://github.com/npm/rfcs/pull/595
It's not about C per-se, as much as each native compiled dependency creates additional maintenance concerns. Changes to hardware/OS can require a recompile or even fixes. NPM build system already requires a JavaScript runtime, so is already handled as part of existing maintenance. The point is that Zopfli either needs to be rewritten for a platform-agnostic abstraction they already support, or else Zopfli will be added to a list of native modules to maintain.
> It's not about C per-se, as much as each native compiled dependency creates additional maintenance concerns. Changes to hardware/OS can require a recompile or even fixes.
This is a canard. zopfli is written in portable C and is far more portable than the nodejs runtime. On any hardware/OS combo that one can build the nodejs runtime, they certainly can also build and run zopfli.
Yes, but it was expected. It's like prioritising code readability over performance everywhere but the hot path.
Earlier in my career, I managed to use Zopfli once to compress gigabytes of PNG assets into a fast in-memory database supporting a 50K+ RPS web page. We wanted to keep it simple and avoid the complexity of horizontal scaling, and it was OK to drop some rarely used images. So the more images we could pack into a single server, the more coverage we had. In that sense Zopfli was beneficial.
I wonder what is the tarball size difference on average if you'd for example download everything in one tarball (full package list) instead of 1-by-1 as the gzip compression would work way better in that case.
Also for bigger companies this is not really a "big" problem as they usually have in-house proxies (as you cannot rely on a 3rd party repository in CI/CD for multiple reasons (security, audit, speed, etc)).
You might save a little bit by putting similar very small files next to each other in the same tarball, but in general I would not expect significant improvements. Gzip can only compress repetitions that are within 32KB of each other.
Switching to a shared cache in the fashion of pnpm would eliminate far more redundant downloads than a compression algorithm needing 20x more CPU.
I’m more concerned about the 40gb of node_modules. Why hasn’t node supported tgz node_modules. That would save 75% of the space or more.
isn't this a file system thing? Why bake it into npm?
efficiency
The os should do file deduplication and compression and decompression faster than npm. But I guess the issue is npm cannot give the os hints to compress a folder or not?
Even if the file system does it that’s still per-file. You still need all the bookkeeping. And you’re not gonna get great compression if a lot of the files are tiny.
The tar contains a whole bunch of related files that probably compress a lot better together, and it’s only one file on disk for the file system to keep track of. It’s gonna be a lot more efficient.
When compiling you’re probably gonna touch all the source files anyway right? So if you load the zip into memory once and decompress it there it’s probably faster than loading each individual file and decompressing each of them, or just loading each individual file raw.
> And you’re not gonna get great compression if a lot of the files are tiny.
The future of compression is likely shared dictionaries, basically recognize the file type and then use a dictionary that is optimized for that file type. E.g. JavaScript/TypeScript or HTML or Rust or C++, etc. That can offset the problems with small files to a large degree.
But again, this is a complex problem and it should be dealt with orthogonally to npm in my opinion.
I also checked and it should be possible for npm to enable file system compression at the OS level on both Windows and MacOS.
Also if it was dealt with orthogonally to npm, then it could be used by pip, and other package systems.
Thank you so much for posting this. The original logic was clear and it had me excited! I believe this is useful because compression is very common and although it might not fit perfectly in this scenario, it could very well be a breakthrough in another. If I come across a framework that could also benefit from this compression algorithm, I'll be sure to give you credit.
5% improvement is basically the minimum I usually consider worthwhile to pursue, but it's still small. Once you get to 10% or 20%, things become much more attractive. I can see how people can go either way on a 5% increase if there are any negative consequences (such as increased build time).
I was under the impression that bzip compresses more than gzip, but gzip is much faster, so gzip is better for things that need to be compressed on the fly, and bzip is better fro things that get archived. Is this not true? Wouldn't it have been better to use bzip all along for this purpose?
I wonder if you could get better results if you built a dictionary over entire npm. I suspect most common words could easily be reduced to 16k word index. Would be much faster, dictionary would probably fit in cache, can even optimize it in memory for cache prefetch.
This seems like a non-starter to me - new packages are added to npm all the time, and will alter the word frequency distribution. If you aren't prepared to re-build constantly and accept that the dictionary isn't optimal, then it's hard to imagine it being significantly better than what you build with a more naive approach. Basically - why try to fine-tune to a moving target?
But is it really moving that fast ? I suspect most fundamental terms in programming and the variations do not change often. You will always have keywords, built ins and the most popular concepts from libs/frameworks.
So it is basically downloading a few hundred kb dictionary every year ?
You’d have language keywords and such, yeah.
But past that the most common words in a React project are going to be very different from a Vue project right?
would it really change that quickly? you might get significant savings from just having keywords, common variable names, standard library functions
It reminds me of an effort to improve docker image format and make it move away from being just a tar file. I can't find links anymore, but it was a pretty clever design, which still couldn't beat dumb tar in efficiency.
Transferring around dumb tar is actually smart because the HTTPS connection can negotiate a compressed version of it to transfer - e.g. gzip, brotli, etc. No need to bake in an unchangable compression format into the standard.
This presents a great opportunity for alternative npm clients (such as bun, yarn). An almost free advantage over the mainline npm CLI.
I think the main TLDR here [1]:
> For example, I tried recompressing the latest version of the typescript package. GNU tar was able to completely compress the archive in about 1.2 seconds on my machine. Zopfli, with just 1 iteration, took 2.5 minutes.
[1] https://github.com/npm/rfcs/pull/595#issuecomment-1200480148
My question of course would be, what about LZ4, or Zstd or Brotli? Or is backwards compatibility strictly necessary? I understand that GZIP is still a good compressor, so those others may not produce meaningful gains. But, as the author suggests, even small gains can produce huge results in bandwidth reduction.
Ok but why doesn't npm registry actually recompress the archives? It can even apply that retroactively, wouldn't require zopli in npm CLI
Hashes of the tarballs are recorded in the package-lock.json of downstream dependants, so recompressing the files in place will cause the hashes to change and break everyone. It has to be done at upload time.
The hashes of the uncompressed tarballs would be great. Then the HTTP connection can negotiate a compression format for transfer (which can change over time at HTTP itself changes) rather than baking it into the NPM package standard (which is incredibly inflexible.)
But it still can be done on the npm side, right?
Try to use this https://github.com/xthezealot/npmprune
My reading of OP is that it’s less about whether zopfli is technically the best way to achieve a 5% reduction in package size, and more about how that relatively simple proposal interacted with the NPM committee. Do you think something like this would fare better or differently for some reason?
It probably makes more sense to save more bytes and compressor time and just switch to zstd (a bigger scoped effort, sure).
Why would a more complex zip slow down decompress? This comment seems to misunderstand how these formats work. OP is right.
Does npm even default to gzip -9? Wikipedia claims zopfli is 80 times slower under default settings.
Yes: https://github.com/npm/pacote/blob/bf1f60f58bb61f053262f5472...
My experience has been that past -6 or so, gzip files get only a tiny bit smaller in typical cases. (I think I've even seen them get bigger with -9.)
Usually people require more than 5% to make a big change
That’s why our code is so slow. Dozens of poor decisions that each account for 2-4% of overall time lost, but 30-60% in aggregate.
This is a nice guy
He should not have been closing this tbh.
5% sounds like a good deal
In summary: It’s a nice feature, which gives nice benefits for often downloaded packages, but nobody at npm cares for the bandwidth?
Nice exercise in bureaucracy.
- I don't see the problem with adding a C dependency to a node project, native modules are one of the nicest things about node.js
- Longer publishing time for huge packages (eg. typescript) is a bigger problem but I don't think it should impact a default. Just give the option to use gzip for the few outliers.
> But the cons were substantial: ... > This wouldn’t retroactively apply to existing packages.
Why is this substantial? My understanding is that packages shouldn't be touched once published. It seems likely for any change to not apply retroactively.
I wonder if there is a way to install the npm packages without the crap they come included with (like docs, tests, readme etc).
I mean 4-5% the size for 10-100x the time is not worth it.
That's not actually so straightforward. You pay the 10-100x slowdown once on the compressing side, to save 4-5% on every download - which for a popular package one would expect downloads to be in the millions.
The downloads are cached. The build happens on every publish for every CI build.
Most packages don‘t publish every CI build. Some packages, such as Typescript, publish a nightly build once a day. Even then a longer compression time dosen‘t seem too bad.
Caching downloads on a CDN helps offload work from the main server, it doesn't meaningfully change the bandwidth picture from the client's perspective.
Assuming download and decompression cost to be proportional to the size of the incoming compressed stream, it would break even at 2000 downloads. A big assumption I know, but 2000 is a really small number.
As the author himself said, just React was downloaded half a billion times; that is a lot of saved bandwidth on both sides, but especially so for the server.
Maybe it would make sense to only apply this improvement in images that are a) either very big or b) get downloaded at least million times each year or so. That would cover most of the savings while leaving most packages and developers out of it.
It absolutely is. Packages are zipped once and downloaded thousands of times.
zopfli is the wrong thing to use here.
If you want an example of where these things go badly:
the standard compression level for rpms on redhat distros is zstd level 19.
This has all the downsides of other algorithms - it's super slow - often 75-100x slower than the default level of zstd[1]. It achieves a few percent more compression for that speed. compared to even level 10, it's like 0-1% higher compression, but 20x slower.
This is bad enough - the kicker is that at this level, it's slower than xz for ~all cases, and xz is 10% smaller. The only reason to use zstd is because you want fairly good compression, but fast.
So here, they've chosen to use it in a way that compresses really slowly, but gives you none of the benefit of compressing really slowly.
Now, unlike the npm case, there was no good reason to choose level 19 - there were no backwards compatibility constraints driving it, etc. I went through the PR history on this change, it was not particularly illuminating (it does not seem lots of thought was given to the level choice).
I mention all this because it has a real effect on the experience of building rpms - this is why it takes eons to make kernel debuginfo rpms on fedora. Or any large RPM. Almost all time is spent compressing it with zstd at level 19. On my computer this takes many minutes. If you switch it to use even xz, it will do it about 15-20x faster (single threaded. if you thread both of them, xz will win by even more, because of how slow the setting is for zstd. If you use reasonable settings for zstd, obviously, it achieves gigabytes/second in parallel mode)
Using zopfli would be like choosing level 19 zstd for npm. While backwards compatibility is certainly painful to deal with here, zopfli is not likely better than doing nothing. You will make certain cases just insanely slow. You will save someone's bandwidth, but in exchange you will burn insane amounts of developer CPU.
zopfli is worse than level 19, it can often be 100x-200x slower than gzip -9.
Doesn't npm support insanely relaxed/etc scripting hooks anyway?
If so, if backwards compatibility is your main constraint, you would be "better off" double compressing (IE embed xz or whatever + a bootstrap decompressor and using the hooks to decompress it on old versions of npm). Or just shipping .tar.gz's than, when run, fetch the .xz and decompress it on older npm.
Or you know, fish in the right pond - you would almost certainly achieve much higher reductions by enforcing cleaner shipping packages (IE not including random garbage, etc) than by compressing the garbage more.
[1] on my computer, single threaded level 19 does 7meg/second, the default does 500meg/second. Level 10 does about 130meg/second.
> the standard compression level for rpms on redhat distros is zstd level 19
> The only reason to use zstd is because you want fairly good compression, but fast
I would think having fast decompression is desirable, too, especially for rpms on redhat distros, which get decompressed a lot more often than they get compressed, and where the CPUs doing decompression may be a lot slower than the CPUs doing the compression.
And zstd beats xz in decompression times.
Here's the Fedora page relating to changing RPM from xz level 2 to zstd level 19.
https://fedoraproject.org/wiki/Changes/Switch_RPMs_to_zstd_c...
Just two tables for comparison, first one shows only the decompression for Firefox RPM, second shows compression time, compressed size, and decompression for a large RPM.
You'd think there'd be more data.
Let me start by reiterating - zstd is a great option. I think zstd level 5-10 would have been an awesome choice. I love zstd - it is a great algorithm that really hits the sweet spot for most users between really fast and good compression, and very very fast decompression. I use it all the time.
In this case, yes, zstd has faster decompression , but xz decompression speed is quite fast, even before you start using threads. I have no idea why their data found it so slow.
Here's an example of this: https://web.archive.org/web/20231218003530/https://catchchal...
Even on large compressed archives, xz decompression times does not go into minutes - even on a mid range 16 year old intel CPU like being used in this test. Assuming the redhat data on this single rpm is correct, i would bet it's more related to some weird buffering or chunking issue in the rpm's compressor library usage than actual xz decompression times. But nobody seems to have bothered look at why the times seemed ridiculous, at all, they just sort of accepted them as is.
They also based this particular change on the idea that they would get similar compression ratio to xz - they don't, as i showed.
Anyway, my point really wasn't "use xz", but that choosing zstd level 19 is probably the wrong choice no matter what.
There own table, which gives you data on 1 whole rpm, shows that zstd level 15 gave them compression comparable to xz (on that RPM, it's wrong in general), at compression speed similar to xz (also wrong in general, it's much slower than that).
It also showed that level 19 was 3x slower than that for no benefit.
Result: Let's use level 19.
Further the claim "Users that build their packages will experience slightly longer build times." is total nonsense - their own table shows this. If you had an RPM that was 1.6gb, but took 5 minutes to build (not uncommon, even for that size, since it's usually assets of some sort), you are now taking 30 minutes, and spending 24 of it on compression.
Before it took ... 3 minutes to do compression.
Calling this "slightly longer build times" is hilarious at best.
I'll make it concrete: their claim is based on building Firefox and compressing the result, and amusingly, even there it's still wrong. Firefox RPM build times on my machine are about 10-15 minutes. Before it took 3 minutes to compress the RPM. Now it takes 24.
This is not "slightly longer build times". Before it took 30% of the build time to compress the RPM.
Now it takes 24 minutes, or 2.5x the entire build time.
That is many things, but it is not a "slightly longer build time".
I'll just twist the knife a little more:
RPM supports using threading for the compressors, which is quite nice. It even supports basing it on the number of cpus you have set to use for builds. They give examples of how to do it, including for level 19:
The table with this single rpm even tested it with threads!Despite this - they did not turn on threads in the result...
So they are doing all this single threaded for no particular reason - as far as i can tell, this is a bug in this well thought out change.All this to say - i support NPM in being careful about this sort of change, because i've seen what happens when people aren't.
> So they are doing all this single threaded for no particular reason - as far as i can tell, this is a bug in this well thought out change
Could be because they want reproducible builds.
First, here is no data or evidence to suggest this is the case, so not sure why you are trying to make up excuses for them?
Second, zstd is fully deterministic in multithreaded cases. It does not matter what threading you select, it will output byte for byte identical results.
See a direct answer to this question here: https://github.com/facebook/zstd/issues/2079
I believe all of their compressors are similarly deterministic regardless of number of threads, but i admit i have not checked every one of them under all conditions.
If they had questions, they could have, you know, asked, and would have gotten the same answer.
But that just goes back to what i said - it does not appear this change was particularly well thought out.
The fact that you are pursuing this is admirable.
But this whole thing sounds too much like work. Finding opportunities, convincing entrenched stakeholders, accommodating irrelevant feedback, pitching in meetings — this is the kind of thing that top engineers get paid a lot of money to do.
For me personally open source is the time to be creative and free. So my tolerance for anything more than review is very low. And I would have quit at the first roadblock.
What’s a little sad, is NPM should not be operating like a company with 1000+ employees. The “persuade us users want this” approach is only going to stop volunteers. They should be proactively identifying efforts like this and helping you bring it across the finish line.
I think that the reason NPM responded this way is because it was a premature optimization.
If/when NPM has a problem - storage costs are too high, or transfer costs are too high, or user feedback indicates that users are unhappy with transfer sizes - then they will be ready to listen to this kind of proposal.
I think their response was completely rational, especially given a potentially huge impact on compute costs and/or publication latency.
I disagree with it being a premature optimisation. Treating everything that you haven’t already identified personally as a problem as a premature optimisation is cargo cutting in its own way. The attitude of not caring is why npm and various tools are so slow.
That said, I think NPM’s response was totally correct - explain the problem and the tradeoffs. And OP decided the tradeoffs weren’t worth it, which is totally fair.
> What’s a little sad, is NPM should not be operating like a company with 1000+ employees. The “persuade us users want this” approach is only going to stop volunteers. They should be proactively identifying efforts like this and helping you bring it across the finish line.
Says who?
Says an engineer? Says a product person?
NPM is a company with 14 employees; with a system integrated into countless extremely niche and weird integrations they cannot control. Many of those integrations might make a professional engineer's hair catch fire - "it should never be done this way!" - but the real world is that the wrong way is the majority of the time. There's no guarantee that many of the downloads come from the official client, just as one example.
The last thing they need, or I want, or any of their customers want, or their 14 employees need, is something that might break backwards compatibility in an extremely niche case, anger a major customer, cause countless support tickets, all for a tiny optimization nobody cares about.
This is something I've learned here about HN that, for own mental health, I now dismiss: Engineers are obsessed with 2% optimizations here, 5% optimizations there; unchecked, it will literally turn into an OCD outlet, all for things nobody in the non-tech world even notices, let alone asks about. Just let it go.
NPM is a webservice. They could package the top 10-15 enhancements call it V2. When 98% of traffic is V2 turn V1 off. Repeat every 10 years or so until they work their way into having a good protocol.
> Engineers are obsessed with 2% optimizations here, 5% optimizations there; unchecked, it will literally turn into an OCD outlet, all for things nobody in the non-tech world even notices, let alone asks about. Just let it go.
I absolutely disagree with you. If the world took more of those 5% optimisations here and there everything would be faster. I think more people should look at those 5% optimisations. In many cases they unlock knowledge that results in a 20% speed up later down the line. An example from my past - I was tasked with reducing the running speed of a one shot tool we were using at $JOB. It was taking about 15 minutes to run. I shaved off seconds here and there with some fine grained optimisations, and tens of seconds with some modernisation of some core libraries. Nothing earth shattering but improvements none the less. One day, I noticed a pattern was repeating and I was fixing an issue for the third time in a different place (searching a gigantic array of stuff for a specific entry). I took a step back and realised that if I replaced the mega list with a hash table it might fix every instance of this issue in our app. It was a massive change, touching pretty much every file. And all of a sudden our 15 minute runtime was under 30 seconds.
People used this tool every day, it was developed by a team of engineers wildly smarter than me. But it had grown and nobody really understood the impact of the growth. When it started that array was 30, 50 entries. On our project it was 300,000 and growing every day.
Not paying attention to these things causes decay and rot. Not every change should be taken, but more people should care.
> Says who?
> Says an engineer?
I prevent cross-site scripting, I monitor for DDoS attacks, emergency database rollbacks, and faulty transaction handlings. The Internet heard of it? Transfers half a petabyte of data every minute. Do you have any idea how that happens? All those YouPorn ones and zeroes streaming directly to your shitty, little smart phone day after day? Every dipshit who shits his pants if he can't get the new dubstep Skrillex remix in under 12 seconds? It's not magic, it's talent and sweat. People like me, ensuring your packets get delivered, un-sniffed. So what do I do? I make sure that one bad config on one key component doesn't bankrupt the entire fucking company. That's what the fuck I do.
Open source needs to operate differently than a company because people don’t have time/money/energy to deal with bullshit.
Hell. Even 15 employees larping as a corporation is going to be inefficient.
what you and NPM are telling us, is that they are happy to take free labor, but this is not an open source project.
> Engineers are obsessed with 2% optimizations here
Actually in large products these are incredible finds. But ok. They should have the leadership to know which bandwidth tradeoffs they are committed to and tell him immediately it’s not what they want, rather than sending him to various gatekeepers.
Correct; NPM is not an "open source project" in the sense of a volunteer-first development model. Neither is Linux - over 80% of commits are corporate, and have been for a decade. Neither is Blender anymore - the Blender Development Fund raking in $3M a year calls the shots. Every successful "large" open source project has outgrown the volunteer community.
> Actually in large products these are incredible finds.
In large products, incredible finds may be true; but breaking compatibility with just 0.1% of your customers is also an incredible disaster.
> breaking compatibility with just 0.1%
Yes. But in this story nothing like that happened.
But NPM has no proof their dashboard won't light up full of corporate customers panicking the moment it goes to production; because their hardcoded integration to have AWS download packages and decompress them with a Lambda and send them to an S3 bucket can no longer decompress fast enough while completing other build steps to avoid mandatory timeouts; just as one stupid example of something that could go wrong. IT is also demanding now that NPM fix it rather than modify the build pipeline which would take weeks to validate, so corporate's begging NPM to fix it by Tuesday's marketing blitz.
Just because it's safe in a lab provides no guarantee it's safe in production.
That’s an argument against making any change to the packaging system ever. “It might break something somewhere” isn’t an argument, it’s a paralysis against change. Improving the edge locality of delivery of npm packages could speed up npm installs. But speeding up npm installs might cause the CI system which is reliant on it for concurrency issues to have a race condition. Does that mean that npm can’t ever make it faster either?
It is an argument. An age old argument:
"If it ain't broke, don't fix it."
This attitude is how in an age with gigabit fiber, 4GB/s hard drive write speed, 8x4 GHz cores with simd instructions it takes 30+ seconds to bundle a handful of files of JavaScript.
disable PRs if this is your policy.
Ok, but why is the burden on him to show that? Are they not interested in improving bandwidth and speed for their users?
The conclusion of this line of reasoning is to never make any change.
If contributions are not welcome, don’t pretend they are and waste my time.
> can no longer decompress fast enough
Already discussed this in another thread. It’s not an issue.
While NPM is open source, it's in the awkward spot of also having... probably hundreds of thousands if not millions of professional applications depend on it; it should be run like a business, because millions depend on it.
...which makes it all the weirder that security isn't any better, as in, publishing a package can be done without a review step on the npm side, for example. I find it strange that they haven't doubled down on enterprise offerings, e.g. creating hosted versions (corporate proxies), reviewed / validated / LTS / certified versions of packages, etc.
[dead]
The problem is that this guy is treating open source as if it was a company where you need to direct a project to completion. Nobody in open source wants to be told what to do. Just release your work, if it is useful, the community will pick it up and everybody will benefit. You cannot force your improvement into the whole group, even if it is beneficial in your opinion.
> where you need to direct a project to completion
Do you want to get a change in, or not?
Is this a project working with the community or not?
> Just release your work
What motivation exists to optimize package formats if nobody uses that package format? There are no benefits unless it’s in mainline.
> Nobody in open source wants to be told what to do
He’s not telling anybody to do work. He is sharing an optimization with clear tradeoffs - not a new architecture.
> You cannot force your improvement into the whole group
Nope, but communication is key. “put up a PR and we will let you know whether it’s something we want to pursue”.
Instead they put him through several levels of gatekeepers where each one gave him incorrect feedback.
“Why do we want to optimize bandwidth” is a question they should have the answer to.
If this PR showed up on my project I would say “I’m worried about X,Y,Z” we will set up a test for X and Y and get back to you. Could you look into Z?
Imagine being in the middle of nowhere, in winter, on Saturday night, on some farm, knee deep in a cow piss, servicing some 3rd party feed dispenser, only to discover that you have possible solution but it's in some obscure format instead of .tar.gz. Nearest internet 60 miles away. This is what I always imagine happening when some new obscure format come into play, imagine the poor fella, alone, cold, screaming. So incredibly close to his goal, but ultimately stopped by some artificial unnecessary made-up bullshit.
I believe zopfli compression is backwards compatible with DEFLATE, it just uses more CPU during the compression phase.
Correct - it is just looking for matches harder, and encoding harder.