Memory Leak

cat scott > /dev/net
UPDATE  22-Jan-2013:
As Marcelo Cantos notes over on www.html5rocks.com, WebSQL performance in this experiment can be significantly improved through the use of an additional index (on the Episode table), and a slight modification to the SQL query (replacing the HAVING clause with a nested subquery).
I would strongly recommend reading Marcelo’s comments and bearing this in mind when weighing up the relative performance; as it changes the findings discussed below.

TL;DR
At first glance, the IndexedDB API appears to require a lot of extra code to accomplish tasks that could be done in a single query with WebSQL; as familiar concepts such as joins, aggregates etc., that we take for granted in the relational world, need to be handled manually. Such multiple steps also suggest that IndexedDB must perform worse than the equivalent WebSQL query. Surprisingly, IndexedDB performance stacks up well against WebSQL and is, in certain circumstances, actually faster.

Background

Since November 2010, when the W3C officially deprecated WebSQL, my intention has been to try out it’s successor, IndexedDB. In particular, I wanted to see how IndexedDB performs in comparison to a real-world WebSQL-based application.

Data Model
The data model for this exercise is a simple, three-tier, parent-child hierarchy based on television programs.
image 
A Program consists of zero, one or more Series, which in turn consist of zero, one or more Episodes. In relational terms, these are both one-to-many relationships from parent to child.

Sample Code

The source code for this exercise can be found on GitHub.

Note:

IndexedDB (at least in Chrome’s implementation) does not work when the page is accessed using the file:// protocol. To run this sample application, you will need to access it though a HTTP server.

A simple way to do this, if you have python installed, is to run

python -m SimpleHTTPServer 8000

from the source directory, then browse to http://localhost:8000/indexedDbTest.html

 Let’s quickly go over the basic code structure, before delving into the database stuff…

The main page is indexedDBTest.html, which references indexedDBTest.css for styling the results, and indexedDBTest.js file for running the sample app.

indexedDBTest.js consists simply of a jQuery ready function that instantiates the two data stores (more on these shortly), and listens for the “Go!” button to be clicked to run the tests.

A pair of booleans at the top of the script (initWebSqlDb & initIndexedDb) control the initialisation of the two data stores. For the initial setup, you will need to set these both to true, to create & populate the WebSQL database and IndexedDB object stores with sample data when the page loads. After this is done once, these can be switched back to false so that you’re not rebuilding the world on each page refresh.

Each data store is populated with an identical data set: 100 x programs, each containing 6 x series, each containing 11 x episodes; giving a total of 600 x series records, and 6600 x episode records. Not a huge amount of data by normal DB standards; but enough to compare the relative performance of each store. The findings of this exercise may not hold true for a larger data set, so bear that in mind. I’ve kept it small on the basis that most apps that store data locally in the client typically don’t store an enormous amount.

Data stores
The two remaining files are indexedDb.js and webSQL.js, which are wrappers for their respective data stores, and both have the following interface:
  • init() - initialises the data store (if required) and populates it with the test data
  • start() - used in conjunction with stop(), records the start time of an operation
  • stop() - calculates the duration since start() was called; and renders the operation results/duration.
  • countPrograms() - counts the number of Programs in the database (ie. 100)
  • countSeries() - counts the number of Series in the database (ie. 600)
  • countEpisodes() - counts the number of Episodes in the database (ie. 6600)
  • schedule() - a ‘real world’ query requiring joins, aggregates etc.  (more on this later)
Counting objects
One of my initial concerns with IndexedDB was how it would handle aggregates (counts, sums, averages etc.). Most of the IndexedDB tutorials/examples I found online seemed very simplistic, showing basic tasks such as putting an object into a store, getting an object out of a store, and using cursors to iterate over a set of objects; but there were no good examples of aggregating objects. Also, at that time, Chrome 17 had not yet implemented the IDBObjectStore.count() and IDBIndex.count() methods as per the W3C spec, so the only way to get a count of objects from an object store or index was through a cursor iteration.

To my ‘set-based’ SQL mind, the initial reaction to this was: computing aggregate values by iterating over a set of records/rows/objects is surely going to be slow.  Sure enough, counting the 6600 episodes (which took < 10ms in WebSQL with a simple SELECT COUNT(*) FROM Episodes) took > 900ms with an IndexedDB cursor.  Nearly 100 times slower.  WebSQL: 1, IndexedDB: 0.

Thankfully, I didn’t have to wait long for Chrome 18 to be released through the Dev channel, which has native count() implementations. After replacing the cursors with these, the time dropped from > 900ms to ~150ms. WebSQL’s ~8ms time still easily beats it, but clearly the native count() methods are a massive improvement over cursor iteration.

From a coding perspective, let’s compare the countPrograms() functions for both WebSQL and IndexedDB:
As you can see, there’s not much difference. I’d even argue that the IndexedDB version looks more succinct. So far, even with IndexedDB’s slightly worse count performance, I’m willing to call it a tie.

Joins
My other concern with IndexedDB was regarding relationships/joins. In the relational-world, when in comes to modelling database entities the rules are usually pretty clear cut (thanks to E. F. Codd’s theory of normalisation). Conversely, the schema-less nature of NoSQL databases means that there is often more than one way to model your data. It’s up to you, as a developer, to determine which model best suits your application.

For this exercise, there are three different approaches to modelling the data:
  1. Separate object stores for each of the three entities (corresponding to the three tables in the WebSQL database), i.e. a “Programs” object store, a “Series” object store, and an “Episodes” object store. This is a many simple objects design.
  2. A single object store for storing everything; and a “type” attribute on each object to distinguish between them. Programs, series & episodes would still be stored as separate objects, but all in the same object store.
  3. A single object store that stores only Program objects.  Within each program object would be an array holding it’s set Series objects, and within those nested Series objects would be an array containing it’s Episode objects. This nested object structure takes full advantage of IndexedDB’s ability to store complex objects; something that a relational database such as WebSQL cannot do. This is a few complex objects design, which works well for one-to-many relationships (but no so well for many-to-many relationships).
Each has it’s pros and cons.  The first approach is probably most familiar to relational database designers, and has the benefit of providing very granular access to each entity’s records, as well as the ability to use those native count() methods I mentioned earlier. The downside is that retrieving objects from multiple object stores (ie. the equivalent of SQL’s “JOIN” keyword) has to be done by manually merging the results of multiple requests.

The latter approach sacrifices granular access to child objects in exchange for simplified joins (i.e. get a program object, and you automatically get all of it’s child series and episode objects along with it).

For simplicity, I chose the first approach (separate object stores).

I wanted to compare the performance of a reasonably complex SQL query consisting of multiple joins, aggregates with a group by/having clause, an order by clause, and an in-line case expression.  An example of this from my tvmanager app is the initial view for the app called the “Schedule” view. The Schedule view lists any series that you either have, or expect to have, recorded to your DVR. The criteria for inclusion in the Schedule view is:
  • any series that is marked as nowShowing OR
  • any series that has at least one episode with a status of Recorded OR
  • any series that has at least one episode with a status of Expected
For each series in the Schedule view, we also include additional information from the parent program (ie. it’s name), as well as some summary aggregate data from it’s child episodes (total # watched, # recorded, #expected etc.), and finally we order the results by the day of week that they are shown, then by program name.

In WebSQL, we achieve all of this with a single query:

Note: there are no indexes of any of the tables in the WebSQL database. I did run some benchmarks after adding indexes on Episodes(EpisodeID, Status), Series(SeriesID) and Series(NowShowing); but there was little improvement, as the query essentially has to do a full table scan of the Series table.  But it’s worth noting anyway.
The code for the schedule() function for WebSQL looks like this:

An equivalent operation in IndexedDB is much more complex, requiring 3 + 3*N separate database requests (where N is the number of series that meet our above criteria) to retrieve all of the information.   Let’s look at each of these in more detail…

The first task is to identify the series objects that meet our criteria, which you’ll recall has three elements.
Firstly, we need to get the list of series objects that are regularly airing (i.e. have a non-null nowShowing property), which is pretty easy as we have an index for that (“series_nowShowing”). So we open a cursor using a key range with a lower bound of 0, iterate over the cursor and populate an associative array.  We use an associative array here rather than a normal array, because it’s likely that there will be duplicates when we get to our next steps, and we want to filter those out.

Secondly, we want the list of series objects that have at least one recorded episode. Again, we can do this using an index on the episodes store (“episodes_status”), and specifying a single key value of “Recorded”.

Finally, we want the list of series objects that have at least one expected episode.  This is identical to the above, just with a different key value.

For performance reasons (and, let’s be honest, because no browsers yet support IndexedDB’s synchronous API), we can fire off each of these three requests asynchronously, and wait for a signal from each to indicate that it has completed (this is the processCandidates.call(this) at the end of each request).

Now that we have our candidate series, we need to loop through these and collect the remaining information required for display (e.g. program name, episode counts etc.). We start by getting the series object itself.
Note: You might notice that I’m still using openCursor to get a single series object. I’m not sure if it’s a bug with Chrome, or I was doing something wrong, but I simply couldn’t get get() to work (I kept getting a DataError exception, which generally means that the key you’re attempting to get doesn’t exist, even though I knew the key existed). Using a cursor seemed to solve the problem, but probably incurs some extra overhead beyond a simple get.

Once we’ve got the required values from the series object (including the programId it belongs to), we get the related program object (again, using openCursor() because of the issue with get())
While we’re firing off a request to get the program, we can simultaneously do the request for the related episodes. We use a cursor to iterate over all of the episode objects for the series, and manually calculating the total/watched/recorded/expected counts.
As per earlier, at the end of each of our parallel requests we set a flag to indicate the request is complete, and call updatePopulatedCount(). Only when both requests for a given series are done will we update our progress counter.
Finally, once we have all the data we need, we sort the results in the same order as we did for the WebSQL version; before rendering the HTML table in the same way as we did for WebSQL.

The Results
As demonstrated above, the code for IndexedDB can be a lot more complicated, as we’re having to do a lot of the work ourselves in JavaScript that is otherwise provided natively by SQL.

That said though, WebSQL takes on average between ~750-850ms to complete the query and render the results; and IndexedDB takes on average ~300-350ms to render the exact same results. Thats over twice as fast.

At this point I should add the disclaimer that
  1. My timing method (diff between two JavaScript Dates) is not particularly scientific, and is at the mercy of the browser’s main event loop. Given that we’re measuring down to the millisecond, it’s probably not 100% accurate.
  2. The code could possibly be optimised more (either on the WebSQL or IndexedDB sides). I didn’t spend a great deal of time squeezing every last ounce of performance out of it.
If you spot something in the code that is causing the results to be skewed (one way or the other), please let me know.
  1. oharagroup posted this