I'm rather surprised that string hashes aren't randomized in JVM - or at least t...

ncruces · 2025-05-02T21:46:03 1746222363

The hashing algorithm is specified in the docs:

   s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

Per Hyrum's law, there's no changing it now.

https://docs.oracle.com/javase/8/docs/api/java/lang/String.h...

ivan_gammel · 2025-05-02T21:37:15 1746221835

Java hashCode contract is to optimize calculation of hash for performance not for collision search resistance. Its sole purpose is to use it in collections. It must not be used in situations where you need cryptographic properties.

tialaramex · 2025-05-02T22:05:27 1746223527

So, the problem here is that you're thinking of "cryptographic properties" as only the "cryptographic hashes" such as those built with the Merkle–Damgård construction (SHA-512/256 being the one you might reasonably pick today)

But, it's actually desirable to have some cryptographic properties in a faster one way function for making hash tables. Read about SipHash to see why.

Because Java didn't (and as others have discussed, now can't) choose otherwise the hash table structures provided must resist sabotage via collision, which isn't necessary if your attacker can't collide the hash you use.

jillyboel · 2025-05-02T23:36:55 1746229015

java hashcodes are just 4 bytes, there will always be collisions

dwattttt · 2025-05-03T00:48:49 1746233329

The point is that since the hash is known and predictable, an attacker who can choose what keys get added to the map can choose them so every key collides, resulting in a DoS.

I assume Java has to add some other layer to avoid this, rather than using a collision resistant hash scheme.

chii · 2025-05-03T06:12:34 1746252754

> added to the map can choose them so every key collides, resulting in a DoS

the application creator needed to have anticipated this threat model, and they can prepare for it (for example, salt the keys).

But to put onto every user a hash that is collision resistent, but costs performance, is unjustified because not every user needs it.

homebrewer · 2025-05-03T11:39:37 1746272377

There are specialized collections in external libraries for these kinds of situations to be used when you need them (and are ready to pay for them in terms of performance).

dwattttt · 2025-05-03T13:27:50 1746278870

I don't imagine it was ivan_gammel's intention, but his mention of the original CCC/tomcat issue I think makes the point quite clear: have you ever had a hashmap of HTTP header to values? I hope it used one of those external libraries, because otherwise that's now a vector for DoS'ing any app that has one.

User influenced/controlled input happens way more than we expect, I think the more sensible approach would be for the map to be safe by default, and to reach for a high performance external map for those times when that particular data structure is the bottleneck.

ivan_gammel · 2025-05-03T14:32:23 1746282743

>I think makes the point quite clear: have you ever had a hashmap of HTTP header to values?

Number of headers is limited by default to a relatively small value (10k), which in the worst case of hash collisions results in relatively fast lookup time due to tree-like implementation of HashMap.

tialaramex · 2025-05-03T16:00:45 1746288045

> due to tree-like implementation of HashMap.

Exactly, this data structure has to provide counter-measures to cope with the fallout from this other poor choice.

The fast data structures for this work don't need a tree here, they're open addressed instead, but this would mean they can't cope if your attacker can arbitrarily collide input, so, too bad.

ivan_gammel · 2025-05-03T18:21:21 1746296481

This data structure has tree-like fallback for hash collisions in general, which do happen regardless of hash algorithm by definition of what is hash. It is not a coping mechanism for a poor choice, it’s elegant design allowing better performance for Comparable keys. By coincidence it mitigates the DoS attack too.

tialaramex · 2025-05-03T20:36:57 1746304617

All the closed addressing strategies mean more allocator burden. In this case, each of these trees needs to grow separately. But with open addressing we can avoid that.

Suppose we have 10'000 key->value pairs to store. A Swiss Table (a popular open addressed hash table design) will need space for 16384 contiguous pairs plus metadata†. So it'll do that allocation once, but with the closed addressing and trees you're paying to grow trees during insertion, you can't know which trees will grow and which are unused.

You're correct that the Swiss Table will see collisions, but an attacker can't choose them, so they're rare. Each collision costs us a search step, but because the Java design incurs a pointer chase (to find the tree) for every lookup that's actually the same price [one memory fetch] as a single collision for the Swiss Table, yet most of the time the Swiss Table sees less than 1.0 collisions per lookup.

† It's ensuring no more than 87.5% of the storage is used and then rounding up to a power of two because that means less work for each look-up step. 11429 would be enough space, but 16384 is the next largest power of two.

ivan_gammel · 2025-05-04T08:25:54 1746347154

> But with open addressing we can avoid that.

In risk management there exist several different approaches: accept, mitigate, transfer, avoid. What makes you think that this design must avoid certain risks? The developers of standard libraries like Java usually base their decisions on industry feedback, IIRC, Oracle team did have a lot of telemetry data and they are good engineers. Do you have any insights proving that their approach is wrong?

tialaramex · 2025-05-04T10:41:03 1746355263

> IIRC, Oracle team did have a lot of telemetry data and they are good engineers. Do you have any insights proving that their approach is wrong?

Time's arrow

Java is a Sun (Microsystems) language, Oracle just bought the entire company much later. So one problem here is that you're imagining this like it's a decision somebody made this century but it isn't, it's a decision from the early 1990s.

Computer Science is a young discipline. In the early 1990s the Open Addressed hash tables do exist but it's very usual to write the simple chained hash table with Closed Addressing and as I understand it that's exactly what is provided in Java 1.0. This data structure is today known to be a bad idea, and I've criticised the choice in say C++ to standardize the exact same data structure (as std::unordered_map) in 2011. But that's twenty years later.

At some point (I believe this century?) Java engineers went back and improved their hash table to use trees not chains, but they were constrained by the baked in hash, so they could not have picked radically modern data structures which require a different hash strategy without significant disruption.

When I got a CS degree the Introspective sort was so new I'd have to have read bleeding edge research papers to know about it. Today it's very silly if you claim you provide a standard library sort and it's not at least as good as Introsort.

The Swiss Table is from the 2010s, it's new enough that there's a Youtube video of them introducing this data structure, not because they did that specifically on purpose but because they're giving a talk and obviously all talks are filmed and uploaded to Youtube these days. Hopskotch hashing - another popular Open Addressing strategy - is from 2008.

So, the question isn't "Is this the best idea today?" but only "Given what they knew 30+ years ago, was this reasonable back then?" and yeah, of course it was reasonable back then but that doesn't magically mean it's the best choice now.

Edited: Added paragraph about the later choice to use trees.

ivan_gammel · 2025-05-04T18:28:49 1746383329

The standard library design choices are influenced not only by algorithm performance, but also by the use cases, existing contracts etc. You did not answer my question. Why do you think choosing alternative algorithm and breaking existing hash contract would be better?

tialaramex · 2025-05-04T22:07:58 1746396478

> You did not answer my question. Why do you think choosing alternative algorithm and breaking existing hash contract would be better?

But I don't think Java should go break all this stuff. I'm just explaining why - in fact - we might want some cryptographic properties for hash algorithms. It can be true both that Java made reasonable tradeoffs at the time (in the 1990s) and that you shouldn't copy those choices today.

There's a long list of things Java picked that in hindsight weren't a good idea and in many cases it would have been really hard to guess that when Java was invented - however today you should not choose what Java did. Biggest example: 16-bit char. Java was invented when UCS-2 might happen, and in that light a 16-bit char makes sense. A few years later it's obvious UCS-2 is not possible and so you're choosing UTF-8 or UTF-16 and that's easy, pick UTF-8. People who'd already bought into the 16-bit char, such as Microsoft and Sun's Java team, were stuck with UTF-16 as a booby prize.

ivan_gammel · 2025-05-05T09:49:42 1746438582

I think I lost you here actually. Hash by definition is just a mapping of arbitrary sets to large integers. Cryptographic hash adds computational complexity requirement so that it is prohibitively hard to find collisions (i.e. slow by design - bad choice for collections, which need fast hashing). Unstable hash is simply the one that is not guaranteed to persist between two different executions of a program (but it obviously has to be the same for equal objects in the same execution). Neither of collision resolution techniques requires cryptographic or even unstable hash. Collections only need hash to be sufficiently random on typical datasets for optimal performance. I would even question if unstable hash can mitigate the hash flooding attacks at all or it is possible to determine the hashing parameters of a black box somehow. Limit on number of parameters and efficiency at high load factors/high collisions is more important for security.

Many programming languages do not use open addressing and prefer chaining (Java, C#, Golang, std::unordered_map in C++) for simplicity and flexibility. There are certain trade offs to be considered, it is not a simple choice of absolutely the best known algorithms. I don't think Java made the wrong choice here. Notably, there was always a possibility to introduce a specific collection with some modern algorithm as an alternative to HashMap. It did not happen not least because it was not considered that important. I'm sure that in cases where open addressing offers substantial benefits, developers make conscious choice and use the appropriate data structures.

Regarding strings, since Java 9 (released long time ago) they can be stored in byte arrays if they contain only Latin-1 characters and core team is open for further enhancements.

See https://openjdk.org/jeps/254

tialaramex · 2025-05-05T17:13:32 1746465212

No, the cryptographic hash functions are not slow by design. This is a nasty misunderstanding which leads to stuff like MD5(password) in PHP [not the PHK MD5 crypt() algorithm used in Unix systems in the 1990s, this was literally MD5(password) and it's a bad idea, do not do]

The cryptographic hash functions such as the SHA-2 family resist construction of a second pre-image, given N and H(N) there's no way for me to make M such that H(M) == H(N) but M != N that'll be faster than brute force trying all possible values for M. So far everything we have which achieves this is markedly slower than say FNV-1 which would once have been the most likely hashing algorithm for a hash table.

The one way password hashes, some of which are based on the cryptographic hashes, are slow by design. You should not use passwords, but on HN this seems like a lost cause so, assuming you insist on using passwords these algorithms are specifically what you need. Fighting about which one to use ("oh no, PBKDF2 isn't memory hard, blah blah blah") is also a bottomless hole of HN nonsnse, so, fine, pick whichever of them you're certain is the only correct one.

But we do ideally want some cryptographic properties for a hash to be used in a hash table and your instinct that just using a random factor and hoping won't work was correct. You need a correctly designed function, which is what SipHash is. What you want is some keyed function with a key K such that if I know N and H(K,N) but not K, I can't guess an M such that H(K,M) = H(K,N) but M != N except by brute force.

> Many programming languages do not use open addressing and prefer chaining (Java, C#, Golang, std::unordered_map in C++) for simplicity

This was true in Golang but today the map in Go is a modified Swiss Table which of course delivers improved performance. We've seen why Java is stuck where it is. The C++ std::unordered_map is known to have poor performance, C++ programmers have many alternatives [including of course Swiss Tables], such as a really nice modern offering from Boost.

I'm actually not sure about the guts of the C# hash tables at all, it's a complicated beast full of conditionally defined elements because it's reused by CLR internals as well as provided as a type for us users. It's not a conventional design of any sort, there seems to have been (which is common for Microsoft) some degree of Not Invented Here - just make it up as we go along. It's storing pre-calculated hashes with each key->value pair, it's tracking how many collisions it has, I dunno, I wouldn't do this, don't copy it without seeing hard numbers for why this is good.

ivan_gammel · 2025-05-03T01:02:40 1746234160

There’s no reason to add extra properties and associated complexity to hash used in collections because of one exotic use case. To be able to execute hash flooding, the server must accumulate arbitrary user inputs in a hash map, which is problematic design anyway. Even if that would make sense, what kind of function could work as 32 bit gash? You mention SipHash as an example: it is poor example anyway, because this function requires a secret key - meaning Java would have to do the key management behind the scene (just for strings? For Number subclasses?) or impose key management on API users. What’s the point? The case is so rare that it’s easier for developers of vulnerable applications to use a wrapper with desired hash properties.

int_19h · 2025-05-03T01:13:37 1746234817

It is not at all an exotic use case, though. It is very common to add external strings as keys to collections. This is precisely why that attack was effective in the first place, and why so many languages these days do hash randomization. E.g. .NET does that.

ivan_gammel · 2025-05-03T01:19:15 1746235155

Unconstrained user input in a hash map big enough to produce observable effect is not an exotic use case compared to everything else?

dwattttt · 2025-05-03T01:32:43 1746235963

The alternative is to never key a hash map from user input. Which is a choice to make, but that seems much harder to achieve (over every program?), vs have the implementation be safe in case someone does key off user input.

ivan_gammel · 2025-05-03T01:48:10 1746236890

The original report from CCC demonstrated an attack on HTTP servers with a lot of request parameters that had hash collisions. That vulnerability was fixed by verifying and limiting user inputs (eg Tomcat 7.0.23 limited number of parameters to 10000 by default among other improvements). You cannot have unconstrained collection for user inputs anyway, otherwise memory based DoS attack becomes possible. So yes, any program that handles user inputs must take care of such scenarios. The best defense perimeter are the parsers of inputs, not just HTTP parameters and headers, but also JSON and XML deserializers, which must be hardened against various types of attacks anyway. Many scenarios where vulnerability may theoretically exist in application code are unpractical for execution of attack, e.g. when inputs are processed after user authentication or rate limit filter.

andrewaylett · 2025-05-02T22:13:20 1746224000

Even in collections, an unstable hash is desirable -- to avoid denial of service attacks caused by attacker-controlled hash collisions.

For example, Python back in 2012: https://nvd.nist.gov/vuln/detail/cve-2012-1150.

pfdietz · 2025-05-02T23:29:25 1746228565

Or the hash table implementation should be resistant to collisions, falling back to another data structure in that case (as described below, using trees instead of lists in buckets with sufficiently many occupants.)

ivan_gammel · 2025-05-03T01:13:49 1746234829

There exists CVE-2012-5373 for Java and it is not fixed because it is not a risk worth taking care of.

GolDDranks · 2025-05-02T22:05:55 1746223555

The thing is, even it being used just in collections can lead to DOS, if the attacker can control the string contents, and selectively choose strings that your hash table stops being a hash table and turns into a linked list.

ivan_gammel · 2025-05-03T01:05:27 1746234327

That’s clear and that is not a reason to have it in general purpose collections or String::hashCode. If your app is vulnerable to this sort of attack, just use a wrapper for keys and specialized collection (you may want to limit the maximum size of it too).

int_19h · 2025-05-03T09:08:41 1746263321

This sounds a great deal like all the C++ devs who say "well just don't write code with buffer overflow bugs then".

The reason why this kind of thing should be the default is because it's unreasonable to expect this level of understanding from your average coder, yet most software is written by the later. That's why PL and framework design has been moving towards safety-by-default for quite some time now - because nothing else works, as proven by experience.

ivan_gammel · 2025-05-03T10:33:53 1746268433

You are misunderstanding and exaggerating this particular risk and assuming there’s only one way to handle it. First of all, it concerns unchecked user inputs: it is not enough just to fix the hash code, the entire HashMap is not great for storing it regardless of hash algorithms. Hash may be ok, but HashMap has no size constraints, so there may exist a vulnerability related to DoS attack exhausting server memory. Developers must always validate inputs as early as possible. Even the dumb ones.

Second, this risk was reliably mitigated in Java as soon as it was discovered. Just because hash collisions may exist, it doesn’t mean they are exploitable. CVE for JDK was not fixed, because it has been taken care of elsewhere, in Tomcat etc, where meaningful validation could take place.

Context matters.

int_19h · 2025-05-04T00:08:42 1746317322

And yet literally everybody else started doing hash randomization. Why?

ivan_gammel · 2025-05-04T08:36:42 1746347802

The vulnerability was reported and addressed in a reasonable way. Why other platforms should matter here? Do you believe they were smarter?

sedatk · 2025-05-02T21:02:28 1746219748

Java uses a tree instead of a linked list for collided items, so the search performance degrades more gracefully (e.g. O(N) vs O(logN)).

77 · 2025-05-02T22:41:35 1746225695

To be more precise, Java initially uses a linked list for nodes within a bin. If the number of items inside the bin crosses TREEIFY_THRESHOLD (which is 8), then that specific bin is converted into a RB tree.

This is detailed in implementation notes comment here: https://github.com/openjdk/jdk/blob/56468c42bef8524e53a929dc...

jillesvangurp · 2025-05-03T09:02:02 1746262922

Depends on which Map / Set implementation you use. There are multiple, each with different and interesting properties.

jillesvangurp · 2025-05-03T09:00:20 1746262820

No, it would break the semantics of equals and hash which dictate that two objects that are equal should also have the same hash code. So hash codes for objects must be deterministic. Which in turn is an important property for sets, hash tables, etc. It's unlikely to be ever changed because it would break an enormous amount of stuff.

For things that need to be secure, there are dedicated libraries, standard APIs, etc. that you probably should be using. For everything else, this is pretty much a non issue that just isn't worth ripping up this contract for. It's not much of an issue in practice and easily mitigated by just picking things that are intended for whatever it is you are trying to do.

xmcqdpt2 · 2025-05-03T12:56:13 1746276973

Languages that choose to fix this problem at the hash code level have hash codes that are still deterministic within a given program execution. They are not deterministic between executions, cf

https://docs.python.org/3/reference/datamodel.html#object.__...

yxhuvud · 2025-05-02T21:17:47 1746220667

Nothing stops the jvm from caching hashes even if the hashes are unique per process invocation.

PaulHoule · 2025-05-02T21:27:53 1746221273

They aren’t and it’s quite unfortunate that the empty string hashes to 0 so it will have to recompute every time although presumably it is quick to compute the hash of the empty string.