DataSet performance in .NET web applications
4 August 2011, by Matthew Richards
A DataSet is (to quote Microsoft) an in-memory cache of data. Essentially it’s a snapshot of part of your database. DataSets can be very useful, but they have some significant annoyances too. This article sets out a few of the key problems, and some possible solutions to attempt, based on extensive experience from building and (more significantly!) maintaining our Reservwire travel reservation product.
DataSet performance is, for us, a perennial bugbear. It’s not really that surprising that a DataSet performs relatively slowly (and here I mean slowly compared to a hand-crafted object containing the same data):
- DataSets can have their schema changed at runtime as well as their data, so inevitably they store their data internally in a slightly harder to handle format than a custom object – you’d expect it to take a little while longer to retrieve or update the data, although not much longer.
- DataSets provide a deceptively large amount of functionality, all of which incurs cost. For example, they keep track of multiple versions of their data (so you can see what’s changed) – updating a value goes from a simple operation to one that may involve storing a snapshot of a whole DataRow. DataSets support constraints, foreign keys, and other features all of which mean additional processing every time you do some work.
DataSet performance isn’t always an issue. There are many cases where you won’t care too much:
- If your code is not in itself performance critical, even a ten-fold slowdown in data access / update performance may not be an issue. 0.1s is no worse than 0.01s for such operations.
- If you’re doing very little with your data, you’re unlikely to care too much. If your only interaction with the DataSet is to call a SqlXml query to fill it with data, and then pick out a few values, the time taken is tiny anyway.
However if you have performance critical code, performing large numbers of update operations on your DataSet before you retrieve any values from it, performance is a problem. And it can be a big problem.
DataSets use up a lot of memory. A recent experiment on some Reservwire data revealed that DataSets were taking up fifty times as much memory as a POCO (Plain Old C# Object) equivalent. That is undoubtedly an extreme case, and a function of the particular set of data being stored in that experiment, but it’s indicative of a genuine problem with DataSets. Why are DataSets so large?
- A significant issue is the schema of the DataSet. Because DataSets are a dynamic data structure and can have their structure changed at runtime, there’s a lot of information in there simply to record that structure.
- Because DataSets remember history, you may well end up storing multiple versions of the same data in memory. That’s fair enough if you need the data, but if it’s overkill for you then you’ve at least doubled your memory usage to no benefit.
Using up memory may not be an issue – memory is, after all, quite cheap. But if you’re using up a lot of memory then things start getting more problematic – with an effective per-process memory limit of about 1.2GB, really large DataSets can eat up your allowance and leave you with Out of Memory exceptions.
The problem is amplified if you’re passing DataSets between processes, because in serialising the DataSet to marshall it across the process boundary you end up using another chunk of memory. Which is where the final key problem comes in…
DataSets don’t serialise cheaply. There are two key scenarios in which we’ve had experience of DataSet serialisation:
- Storing DataSets on disk, for caching purposes.
- Passing DataSets between processes (in a multi-tier web application).
The key issues we’ve faced over time have been:
- The size of the serialised DataSets. They take up a lot of memory, but they can take up even more disk space, because (in .NET 1.1, and by default in later versions too) they store their complete schema in XML format as well as all their data. The size of the serialised DataSets can also cause memory fragmentation – this is always liable to occur if you’re allocating large numbers of large blocks of memory, interspersed with smaller allocations. And memory fragmentation leads to OutOfMemoryExceptions, and application instability.
- Restrictions on serialisation. You can’t serialise a DataRow – which makes sense, because it’s tightly coupled to the rest of its DataSet, but means lots of extra code to handle passing the entire DataSet across a process boundary when in truth you could do without much of that data.
- Assembly versioning problems. If you serialise a .NET object, you can use a BinaryFormatter and ask to avoid including assembly version information with the serialised object – hence provided you don’t change your interface, you can deserialise the data even if you release a new version of your code in the meantime. Does this work with DataSets? Of course not – they manage to inject version-specific information into the serialisation stream anyway.
For a start, you should think about whether DataSets are the right choice. They may be – DataSets are powerful tools and you can get set up with them quickly. But don’t use them blindly.
Even if you think you’re not going to use DataSets in general, you might do it without thinking. A common ADO.NET database access pattern is to use a DataAdapter to call a stored procedure and write the results into a DataSet. This is very useful if you actually want your data to be in a DataSet, but what about when you don’t? You might well choose to create the DataSet anyway, and then just copy data out of it into whatever structure you really want – but this is not such a good idea as it might sound:
- It’s slow, because you have all the DataSet creation and access overhead
- It uses a lot of memory, because DataSets are memory-hungry
As an alternative, consider using DataReaders whenever possible. These will stream the data back to you on demand, meaning that you can avoid storing the entire result set in memory. Even if you do want all the results in memory, don’t incur the overhead of creating an extra copy of all that data in a DataSet first.
Big DataSets cause problems. Some of these problems are obvious, but others less so:
- If you pass large amounts of information round in your DataSets, you’d expect that to be slower and more memory-hungry than if you pass around small amounts. That should be obvious. So good design suggests you should restrict data to the minimum amount you need.
- Think about the size of the schema, not just the data. A DataSet with a big schema may use considerably more space storing the schema than storing the data! So if you have a single DataSet schema for your whole database, and think that by only loading a small amount of data into it at once you’ve avoiding problems, think again. Prefer many small DataSets to one large one.
- If you serialise a large DataSet (either large in schema or large in data – or both!), a large chunk of memory will need to be allocated. The larger this chunk of memory the more likely you are to find that it causes you to run into memory fragmentation issues and OutOfMemoryExceptions.
So – a DataSet with a small schema, containing a small amount of data, is your friend.
DataSets perform slowly compared to POCOs because they provide more functionality. But not all that functionality is actually mandatory! You can turn some of it off, perhaps just for a time, and potentially gain some significant speed boosts. Here are some suggestions:
- Turn off constraint checking – set EnforceConstraints = false on your DataSet. This prevents the DataSet from checking for things like non-nullable columns and foreign key violations; if you don’t need that functionality, why slow things down by leaving it turned on?
- Notify the DataSet you’re going to make several changes in a row – call BeginEdit() on your DataRow, then EndEdit() when you’re done editing data. Each time you change a value in a DataRow, the framework does work to store the original version of the row, update its internal data structures etc. Calling BeginEdit lets it do that once, up-front, and so you save considerable processing time when you make the individual field changes. EndEdit does the necessary tidying-up afterwards.
- Tell the DataSet before you bulk load data – call BeginLoadData() on your DataTable, then EndLoadData() when you’re done. This covers all the above and more – it turns off all the DataSet’s internal processing for the duration. You may find it useful to write an extension method on DataSet which does this for all tables in the DataSet, if your bulk load operations affect many tables at once (which they would normally do).
Starting in .NET 2, you can serialise your DataSet in binary format rather than XML. Do it! Binary-serialised DataSets are much smaller than XML-serialised ones, saving memory and reducing fragmentation. Set RemotingFormat = SerializationFormat.Binary today.
Of course, there are benefits to XML serialisation in some applications – XML facilitates data interchange between systems. But if you’re only using DataSets within your own contained .NET environment, there’s little reason not to switch.
Although this is the last in my list of solutions, it’s one you should consider very early on in your design process because fundamental architectural decisions will influence how much DataSet serialisation goes on in your application. The biggest item to watch out for is passing DataSets between processes – for example in a multi-tier web application, you might return a DataSet from your business logic tier to your application tier.
Cross-process communication may be essential to your architecture, but it is intrinsically slower than working entirely in-process, and DataSets are particularly good at slowing you down – the cost of serialising a DataSet, and then deserialising it again at the other end, is significant. Moreover, as discussed above, they take a lot of memory to serialise and this can exacerbate any memory usage problems in your application. If you’re thinking about passing a DataSet between two processes, think very carefully before committing yourself.
If you already have an application that suffers from cross-process DataSet serialisation problems, ask yourself:
- Can the process boundary be removed? While condensing two tiers of your application into one may ring architectural alarm bells, it’s possible it’ll be worth the trade-off. Check how big the cross-process performance penalty is by adding some trace statements into your serialisers and deserialisers.
- Can the DataSets be dispensed with? You’re not going to win any prizes by converting your DataSet into some other form simply to get across the process boundary – that’s serialisation by another name. But you might be able to replace the DataSet entirely or, perhaps more likely, identify some places where you’re passing too much data to your application tier and should be limited processing of the DataSet contents to the application tier.
You don’t have to get rid of all your DataSet serialisation – find the critical paths and the largest DataSets, and concentrate on them first.
This is not intended to be a complete list of every problem DataSets can cause, or an exhaustive set of solutions. But it’s a list of some pointers which, while you may find them more or less obvious, have been key factors in making DataSets painful for us and then in (to an extent) taming them. The key lessons to take away are:
- DataSets are slow because they are powerful. You can mitigate this performance problem, but a DataSet will always be slower than hand-crafted code.
- DataSets are big because they’re flexible. Bear this in mind when designing your application, because this can be a hard problem to work around once you’ve committed yourself. Keep them small – but they’ll still always be bigger than the equivalent hand-crafted code.