Fixing aggregation over relationships with Sequelize


17 July 2015, by

I’ve been setting up Sequelize recently for a new Softwire Node.js project that’s starting up. As part of the initial work we wanted to investigate Sequelize (the go-to Node SQL ORM) in a little depth, to make sure it could neatly handle some on the trickier operations we wanted to perform, without us having to fall back to raw SQL all the time.

Most of these came out very easily in the wash, but one was trickier and needed investigation and upstream patches, and I’d like to take a closer look at that in this post. The challenging case: aggregating values across a relationship (i.e. SUM over a column from a JOIN).

Some Background

The project is sadly confidential, but the core operation has an easy equivalent in the classic Blog model of Posts and Comments. We have lots of Posts on our blog, and each Post has 0 or more Comments. For the purposes of this example, Comments can have some number of likes. Defining this model in Sequelize looks something like:

var Sequelize = require('sequelize');
var db = require('./db-connection');

var Post = db.define('Post', {
    publishDate: { type: Sequelize.DATEONLY }
});
var Comment = db.define('Comment', {
    likes: { type: Sequelize.INTEGER }
});

Post.hasMany(Comment);

With this model in place, Sequelize makes it easy for us to do some basic querying operations, using its Promise-based API:

db.sync().then(function () {
    return Post.findAll({
        // 'Include' joins the Post table with the Comments table, to load both together
        include: [ { model: Comment } ]
    });
}).then(function (results) {
    // Logs all Posts, each with a 'Comments' field containing a nested array of their related Comments
    console.log(JSON.stringify(results));
}).catch(console.error);

The Problem

Given this Post/Comment model, we want to get the total number of likes across all comments for matched set of articles (‘how many likes did we get in total for this month’s articles?’). A great result would be a SQL query like:

SELECT SUM(comment.likes) AS totalLikes
FROM dbo.Posts AS post
LEFT OUTER JOIN dbo.Comments AS comment ON post.id = comment.postId
WHERE post.publishDate >= '2015-05-01'

Sequelize in principle supports queries like this. It allows an ‘include’ option (in the example above), a ‘where’ option (for filtering) and an ‘attributes’ option (specifying the fields to return). There is also Sequelize.fn, to call SQL functions as part of expressions (such as the attributes we want returned). Combining all of these together suggests we can build the above with something like:

db.sync().then(function () {
    return Post.findAll({
        // Join with 'Comment', but don't actually return the comments
        include: [ { model: Comment, attributes: [] } ],
        // Return SUM(Comment.likes) as the only attribute
        attributes: [[db.fn('SUM', db.col('Comments.likes')), 'totalLikes']],
        // Filter on publishDate
        where: { publishDate: { gte: new Date('2015-05-01') } }
    });
}).then(function (result) {
    console.log(JSON.stringify(result));
}).catch(console.error);

Sadly this doesn’t work. It instead prints “Column ‘Posts.id’ is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause”, because the generated SQL looks like:

SELECT Post.id, SUM(Comments.likes) AS totalLikes
FROM Posts AS Post
LEFT OUTER JOIN Comments AS Comments ON Post.id = Comments.PostId
WHERE Post.publishDate >= '2015-05-01 00:00:00.000 +00:00';

This is wrong! This attempts to load the total aggregate result, but also to load it along with all the Post ids involved, which isn’t really meaningful in SQL, unfortunately. If we group by Post.id this will work (and that is possible in Sequelize), but in reality there are a large number of Posts here, and we’d just like a single total, not a total per-Post that we have to load and add up later ourselves.

Making this Work

Unfortunately it turns out that there is no easy way to do this in the Sequelize, without getting involved in the internals. Fortunately it’s open-source, so we can do exactly that.

The real problem here is that the ‘attributes’ array we provide isn’t being honoured, and Posts.id is being added to it. After a quick bit of analysis tracing back where ‘attributes’ get changed, it turns out the cause of this is inside findAll, in Sequelize’s model.js. Take a look at the specific code in lib/model.js lines 1176-1187. This code ensures that if you ever use an ‘include’ (JOIN), you must always return the primary key in your results, even if you explicitly set ‘attributes’ to not do that. Not helpful.

The reason this exists is to ensure that Sequelize can internally interpret these results when building models from them, to reliably deduplicate when the same post comes back twice with two different joined comments, for example. That’s not something we need here though, as we’re just trying to load an aggregate and we don’t want populated ‘Post’ models back from this, and it causes a fairly annoying problem (for us and various others). There is a ‘raw’ option that disables building a model from these results, but that sadly doesn’t make any differences to the behaviour here.

In the short-term, Sequelize has ‘hooks’ functionality that lets you tie your own code into its query generation. Using that, we can put together a very simple workaround by changing our connection setup code to look something like the below (and this is what we’ve done, for the very short-term).

function resetAttributes(options) {
    if (options.originalAttributes !== undefined) {
        options.attributes = options.originalAttributes;
        if (options.include) {
            options.include.forEach(resetAttributes);
        }
    }
}

var db = new Sequelize(db, username, password, {
    "hooks": {
        "beforeFindAfterOptions": function (options) {
            if (options.raw) resetAttributes(options);
        }
    }
}

If you’re in this situation right now, the above will fix it. It changes query generation to drop all ‘attributes’ overrides if ‘raw’ is set on the query, solving this issue, so that running the aggregation query above with ‘raw: true’ then works. Magic.

Solving this Permanently

That’s all very well for now, but it feels like a bit of a hack, and this behaviour seems like something that’s not desirable for Sequelize generally anyway.

Fortunately, we’ve now fixed it for you, in a pull request up at https://github.com/sequelize/sequelize/pull/4029.

This PR solves this issue properly, updating the internals of model.js to not change the specified attributes if it’s not necessary (if ‘raw’ is set to true) both for this case (attributes on the query root), and the include case (the attributes of your JOINed models). That PR’s recently been merged, solving this issue long-term in a cleaner way, and should be available from the next Sequelize release.

Once that’s in place, this blog post becomes much shorter: if you want to aggregate over an include in Sequelize, add ‘raw: true’ to your query. Phew!

Tags: , , , , ,

Categories: Technical

«
»

Leave a Reply

* Mandatory fields


six + = 15

Submit Comment