A love letter to flame graphs: or why the perfect visualisation makes the problem obvious

We recently took over development of a web application from another supplier. Amongst other things we needed to fix its performance problems and get it live on a tight deadline.

The application in question was a website and Python API, which needed to be capable of handling a high burst load even though normal baseline load would be quite low. The intention was to use CPU-based autoscaling to scale up as necessary.

We started some exploratory load tests early but the results were not promising: with even a low number of users, requests were timing out on the API side. However, we didn’t see any issues with downstream requests made by our API and the CPU utilisation remained extremely low. With low utilisation, our CPU-based autoscaling was never going to kick in either. Where to look first? If we don’t get to the bottom of this, the client can’t take their web application live.

We did solve it, we hit the deadline and the site went live; and in the end, it was (almost) a one-line code change. But let me explain how we got there.

Opening a window into your application

If requests are timing out but CPU utilisation is still low and there are no problems with downstream requests, then it suggests that requests are not making good progress inside the application. Flame graphs are a great way to visualise what an application is up to and why this might be the case.

For the uninitiated, a flame graph is a way to visualize stack traces of profiled software to identify “the most frequent code-paths”. They are typically used to show you where the CPU is spending all of its time while running your program. In our case, with low CPU utilisation, we also want to show what our threads are doing when they are not running on the CPU.

py-spy is a wonderful little Python profiler: it attaches to a running Python process and then periodically takes a snapshot of the full stack trace of processes, threads[1], native functions and Python functions, producing a flame graph showing where the CPU is spending its time. There’s more than one way to capture a flame graph in Python, but I’d used py-spy before to great effect for a previous project, so knew it would be a good choice[2].

A flame graph is worth a thousand words

Let’s look at some flame graphs showing running threads and idle threads. Our program is single process but multithreaded, so there is no need to include multiple processes (more on this later). Here are some graphs showing non-idle threads and all threads respectively[3]:

Figure 1: A flame graph showing non-idle threads:

Figure 2: A flame graph showing all threads:

In these graphs, stack traces grow downwards from the top of the graph, and a wider bar means that more time has been spent in this function call.

So, what’s going on here? We’re spending all of our idle time and a surprising amount of our non-idle time waiting for a threading.RLock in asgiref.local:100, followed by PyThread_acquire_lock_timed[4], and so on into glibc. If we follow these calls back up the stack traces, we can see we’re spending a lot of time waiting to get access to various shared resources protected by locks, such as a pool of database connections shared amongst multiple threads.

GUnicorn to the rescue

Tempting as it is to hang around and work out exactly what’s going on here, in practice this is enough to know where to start. The server in this case is Waitress, a production-quality pure-Python WSGI server with “very acceptable performance”[5].

Waitress uses a single main thread and multiple worker threads (you can see this on the flame graphs actually – the main thread has waitress-serve near the top, and the worker thread has handler_thread). The main thread is responsible for handling the connections while the actual work is passed over to the worker threads. You can see that a lot of the time on the main thread is spent waiting for network calls, exactly as you’d expect[6]. But in our case the worker threads are spending all of their time fighting over locks, and not a lot of actually useful work is getting done.

We switched Waitress out for Gunicorn, which is a popular Python server[7] with a pre-fork[8] threading model: instead of a main thread there is a main process with several worker processes. Additionally, if you look at their architecture docs you’ll see you can choose to use either regular synchronous code, asyncio, gthreads, eventlets, or all sorts of other things inside the worker processes themselves. Explaining the difference between all of these is another article on its own; let’s just say they are different ways to run code concurrently in a Python process. Knowing a bit about the kind of load we were expecting – including that a majority of time would be spent waiting for downstream network requests – we went for eventlet, which is a green-threading implementation[9].

After we switched, we saw an immediate performance boost and the utilisation jumped right up too. Unfortunately, with green threads you don’t get a very pretty picture because the stack traces change so often as the application context switches, so I can’t show you a beautiful flame graph showing only productive work being done! You’ll just have to take my word for it[10].

Conclusion: Start with the Golden Signals and be prepared to instrument.

The four so-called golden signals of monitoring are latency, traffic, errors and saturation, made famous by Google in their book on Site Reliability Engineering. For One Normal Web App, these very often give you a good indication of where to start looking. In our case, as traffic increased, timeout errors were increasing from our API without any problems in our downstream APIs, yet the saturation (CPU) was still low: this is a good indication that the problem was inside the API itself. Something was preventing our code from making good progress.

Once you’ve got an idea of where to look, it’s important to be able to get the right instrumentation into your app quickly, or better still to have it already available to switch on when needed. Some cloud providers already support this: I’ve seen GCP’s Dataflow profiler used to great effect on another of our projects at Softwire.

[1] It excludes threads which are idle by default but we can disable that.

[2] I’m personally excited to try out the new support for the perf tool in Python 3.12, but perhaps that will be a later article.

[3] Where it says “[our code]” it’s because I’ve rewritten the stacktrace manually.

[4] If you’re eagle eyed you might notice that there is in fact more than one implementation of this function behind a compile time flag. The one I’ve linked there is the one we’re using on a modern-day Linux box.

[5] I’m not throwing shade, that’s what it says in their docs.

[6] In Waitress, the main thread is responsible for all incoming network connections, so you’d expect it to spend a lot of time waiting for activity on these connections (select_poll etc) as well as reading and writing to them (sock_recv etc). Interestingly, it’s also spending a lot of time in logging related code, but let’s not get into that for now!

[7] Technically a WSGI or ASGI server, but let’s not get bogged down by confusing acronyms.

[8] It’s pre-fork because the workers are forked in advance when the program starts up, rather than a new thread for every incoming request.

[9] Green threads are lightweight threads scheduled and managed by the application/runtime and not the OS. They are typically backed by a much smaller number of OS threads; often just the one. . If you’d like to know more about when to use these (or not use these) then the Gunicorn docs are a good place to start.

[10] You might have been expecting me to talk about the Global Interpreter Lock (GIL) in the last section: very few articles about Python and threading performance are complete without a discussion of the GIL! As the name suggests, this is a global lock which means that only one thread may interact with the interpreter state at any time. If a program is heavily CPU-bound, in a way which requires a lot of interactions with the interpreter state, then the GIL can become a real bottleneck. (Ok, this is a bit of a simplification: there can be all sorts of strange interactions with the Linux kernel scheduler too impacting non-CPU bound threads; you can read more at https://wiki.python.org/moin/GlobalInterpreterLock or on a hundred and one other articles on the internet). This was not the case for us, but we did have it in mind when we chose Gunicorn and the eventlet worker type: using multiple processes means you get a GIL per process, and inside each process the green threads will not block on the GIL because only one of them will ever be running at once in its “parent” Python thread.