The main goal of this issue is to make sure we don't forget to take this into account when implementing Roc's standard hash maps! It can be closed when we have a hash flooding mitigation strategy in place in the standard library.
Hash flooding mitigation is a complicated issue, but I did a bunch of investigation and thinking about it in 2019, and I want to record some thoughts here to get them out of my head. This can all be revisited when we get to adding an actual hashmap implementation to Roc.
Hash Flooding Attacks
So hash flooding attacks are a thing. The basic idea is to DoS a web server by sending it HTTP payloads which are malicious in a very specific way: the attacker knows the server will parse those payloads into hash maps, and the payloads are designed to produce so many hash collisions that the hash map becomes effectively a linked list.
Lookups into linked lists are super slow compared to hash lookups, and there's apparently been a proof-of-concept of being able to lock up an entire server (Rails or PHP or something - I forget) with a single POST. This suggests an individual could theoretically DoS a vulnerable server without needing a whole botnet to turn DoS into DDoS.
Hash flooding attacks have been demonstrated in proof-of-concept laboratory attacks against real-world systems (e.g. Perl, Rails), but I have not been able to find any documented cases of a real production system actually being hit by one. Nevertheless, I'm convinced the concern is worth taking seriously.
Hash Flooding Mitigation: SipHash
In this talk the people behind some of the proofs-of-concept (and also someone who writes software for DNS servers, which are by default vulnerable to these attacks) talk about defending against hash flooding attacks.
The mitigation strategy they recommend is to use the SipHash hashing algorithm, which was published by researchers working on these hash flooding proofs-of-concept. SipHash has these characteristics:
I really want to avoid having random hashing in Roc. We can get a lot of benefits from everything in the language being a pure function - in terms of caching, reproducibility of test runs, etc. - and using a randomly seeded hasher would violate some of those assumptions. For example, if a different seed were chosen before every test run, then we'd have tests that are both "pure" and also potentially flaky.
Those flakes might reveal improper dependence on the hashing function on the part of the implementation, which would be perhaps worth the disruption of the flake, but they might also be net harmful if it's only the test which relies on the hash. Either way, it'd be pretty frusrtrating if clearing your cache caused one or more of your tests to either pass or fail, when they're all pure functions.
Rust uses SipHash by default.
Hash Flooding Mitigation: Tree Conflict Resolution
The arguably more robust defense against hash flooding attacks is to do collision resolution using a tree rather than a linked list. This way, instead of the collisions causing the hash table to degrade to be effectively a linked list, they cause it to degrade to be effectively a tree - which of course does not take nearly as long to search.
In the talk, this strategy is acknowledged extremely briefly but dismissed as something that isn't really used in practice, because hash map implementors don't want to bother implementing tree-based collision resolution because trees are more complicated than linked lists.
Tree-based resolution sounds worth the effort for Roc. It sacrifices neither referential transparency nor performance - because we can put it behind a conditional based on number of collisions so far (e.g. don't switch from linear probing to a tree until after like 16 collisions, which suggests something fishy is happening) that can be correctly branch predicted basically every time outside of an actual hash flood attack.
One reason Rust might not do hash flooding defense this way is that in order to have an efficient tree, you need an Ord
as well as a Hash
implementation for the keys. Rust might not want to impose that restriction on all HashMap
users, but in Roc we can automatically derive one with no extra work on the part of end users anyway.
Theoretically we could do that extra Ord
derivation only in production builds, but in practice let's be honest - nearly all hash keys are strings or integers anyway, so it's probably not going to save anyone any noticeable amount of code gen.
Hash Flooding Mitigation: Re-Hashing
I looked into this a bit at one point, and it seemed that re-hashing could also be a way to mitigate hash flooding attacks. That wouldn't even require an ordering constraint. I haven't investigated it deeply though!
Hash Flooding Mitigation: Outside the Hash Map
There are various server-specific ways to defend against hash flooding. For example, to defend against hash flooding attacks using HTTP headers specifically, you could cap the number and length of HTTP headers, such that after encountering excessively long or excessively many headers, the request gets rejected immediately. This sort of thing should probably be done regardless (along with timeouts), as other DoS attacks rely only on sending lots of data.
However, assuming tree-based conflict resolution can be done at the insignificant runtime cost of one always-correctly-predicted branch (plus some code size), that would benefit platform and application authors alike in that they would be "default safe" in case they overlooked a particular attack vector.
Design Idea
It seems like a reasonable idea to start with hashbrown
- which is what Rust used to speed up its standard HashMap
, and which is based on Google's highly optimized SwissTable implementation. From there, we'd add the Ord
requirement and the extra logic to fall back on a tree after a certain number of collisions.
Alternatively, we could avoid the Ord
requirement by using rehashing. The basic idea would be that after a suspiciously high number of collisions (perhaps eight, for example) we redo the hash calculation passing a tuple of (val, salt)
instead of just val
, in hopes that his will no longer collide. (The salt could be - for example - the total number of collisions we'd encountered so far, or in a linear probing scheme, the most recently tried bucket index.) The idea is that it would be infeasible to find a payload which not only triggered the initial collisions, but which also somehow continued to collide after salted rehashing.