An actual Java bug: Investigating a JVM crash on multi-threaded string access


8 July 2012, by

When your program doesn’t work as you expected, it is often tempting to blame a bug in the framework. Over the years, I have come to regard these claims with extreme suspicion. I have seen apparent “framework” bugs turn out to be programming mistakes hundreds of times. However, when the JVM crashed in some java-only string manipulation code (or gave different answers on different runs with the same input), I realised I had stumbled upon a real Java bug.

Background

One of our teams was getting ready to release their new Java data layer service to update one component of a large high-traffic live website. Everything had been tested multiple times, with unit tests and user testing. They had released the current version to the staging environment (a fairly realistic copy of the live server farm) and asked the customer’s ops team to start load testing.

After the testers had hammered the service with concurrent connections for a while, it gave a couple of very strange error stack traces and then stopped responding altogether to some requests, and giving 404 errors for other requests (on URLs which mapped to service methods which definitely exist, and had been working previously).

Investigation

When I picked up the investigation, there had been some investigation into the 404s and loss of service without much luck. I decided to start by looking closely at the very first error to come up — in complex frameworks like Java Spring once things start to go wrong, the various loosely coupled bits can fall apart and give very strange errors. Often the very first error is the only one worth looking at, and all subsequent errors are just noise.

The first error was:

java.lang.ArrayIndexOutOfBoundsException: -14680060
	at com.customer.data.EncodedStringToIntConverter.decode(EncodedStringToIntConverter.java:49)

For the following method:

public class EncodedStringToIntConverter {
    private static final char[] DIGITS = "poebthdzmun98x472".toCharArray();
    private static final int BASE = DIGITS.length;

    // ...

    public long decode(String encoded) {
	// ...
        char[] input = encoded.toCharArray();

        long output = 0;
        int ordinal = 0;
        int digit = 0;

        for (char inputChar : input) {
            int salt = digit + ordinal;
            digit = ArrayUtils.indexOf(DIGITS, inputChar);
            int cardinal = digit - salt;

            // Undo '% BASE' - possible because we know the
            // cardinal value itself between 0 and BASE
            if (digit < salt) {
                cardinal += BASE;
            }

            output += (cardinal * Math.pow(BASE, ordinal));

            ordinal++;
        }

        return output;
    }
}

This method is used to de-obfuscate IDs used in some URLs. The team wanted to use incrementing ids, but didn’t want the URLs to be obviously incremental integers (as it might encourage people to navigate by editing the URL), so they obfuscated them by converting the IDs to strings in base 26 using a scrambled set of digits. When requesting an entity by obfuscated ID, the system needs to convert the string back into a number before doing a DB lookup. (See “Finding a workaround” below for alternatives to this approach.)

Now the above code was throwing ArrayIndexOutOfBoundsException when under heavy load.

The first thing I did was carefully read through the code and look for anything which might explain this. Everything looks fine to me: there is no mutable state, so this method should give the same answer every time it is called, and it should not be possible to cause an ArrayIndexOutOfBoundsException since it uses a “foreach” loop.

So I opened up the code at the same version that was deployed to the test site and ran the unit tests which covered that class. It appeared to work fine. The evidence seemed to suggest that this code was not threadsafe in some way (even though it looks like it must be thread safe), so I wrote a test which stressed the code as much as I could to see if it would break. My test started 20 separate threads, each of which ran

assertThat(converter.decode("42urp"), equalTo(2L));

in a tight loop 1000 times.

Suddenly, very strange things started happening. The test would sometimes fail by giving the wrong answer (even though the method is completely deterministic and really ought to give the same answer each time), and sometimes the JVM would crash and write out a dump file (with instructions on how to submit a bug report).

The test would reliably run fine with 2 threads, but would either give the wrong answer or completely crash (about 50:50 between the two) when run with 20 threads.

Isolating the cause and reporting the bug

Now I had confirmed that there was a JVM bug, there were two tasks: 1) isolate the cause and prepare a bug report for Oracle/Sun, to see if they would fix it or offer a workaround and 2) figure out some way to work around the bug for now, so we could still release our update to the site.

Isolating the cause was quite straightforward now that I had a unit test which reproduced the issue reliably. I just had to try removing as much as possible of the code in the above method without fixing the problem. After a little while, I found that the bug seemed to need all of the code above to reproduce the issue.

I still needed to anonymise the code a bit to protect the customer’s super-secret ID obfuscation scheme, so I removed the package name, changed the class name and changed the letters in the DIGITS char array.

After checking that the slimmed down version still caused the bug, I filled out the (very comprehensive) bug report form on Oracle’s website, and am now the proud owner of bug number 7159299 in the JVM.

Finding a workaround

That was all very interesting, but the customer still wants their new website, and they don’t want it to crash every time more than two people visit the same page.

There are a few major refactorings which would have avoided this bug. For example, the best way to have unguessable IDs on entities is not to use an ID obfuscation scheme which relies on security-through-obscurity, but to generate random public IDs on each entity and to store them in the database, either as the primary key on the table or as a secondary indexed lookup column in addition to the numeric IDs. However, that was too large a change at this point in the app’s lifecycle.

So I tried a few minor refactorings of the converter class to see if I could keep the same semantics while avoiding this bug. This was quite a strange exercise — it is very unusual to be programming a non-deterministic system without knowing the exact rules you can follow!

After a little while, I found that converting the DIGITS array from static to non-static seemed to fix the issue on my machine. I committed this change along with a comment explaining why this apparently constant field needs to be non-static.

(Later, Harry found that change wasn’t enough to prevent the issue on the linux servers, but that converting the “int” values to use “long” was sufficient to avoid the bug.)

Conclusions

Oracle have taken no action on the bug since I raised it several weeks ago, other than to rate it as “Priority: 4-Low”. In any case, the Java release cycle is quite slow, so a fix wouldn’t be available for the Java on our servers for months. Raising the bug was quite satisfying though.

I was quite impressed with how the JVM handled the crash. It wrote a very detailed crash dump (see the bug) with what looks like plenty of information for the Java devs to be able to reproduce the issue. The bug submission form asked excellent questions of me. Now if only Oracle would assign some dev resource to look at the issue…

It is probably not worth speculating too much about what is wrong in the JVM to cause this bug. Clearly some aspect of iterating over characters in strings is not quite thread safe. Once the bug is triggered, the whole JVM can crash, or it can start giving incorrect results in unrelated code. On that basis, it seems to me that they must have made an incorrect threading assumption in some fairly low-level string handling code which could result in corrupted stack data or JVM state.

This serves as another reminder that concurrent programming is hard, and that unit tests are very helpful :-)

Tags: ,

Categories: Technical

«
»

One Response to “An actual Java bug: Investigating a JVM crash on multi-threaded string access”

  1. Richard Bradley says:

    Since this was written, Oracle have hidden the Java bug, probably for “security reasons”. The bug appears to be fixed in the latest version of Java 6. I think we were using Java 1.6.0_31 at the time, and the bug seems to be fixed some time between that version and Java 1.6.0_45.


Leave a Reply

* Mandatory fields


+ 1 = two

Submit Comment