Categories
computer games conferences databases GDC 2008 massively multiplayer Uncategorized

GDC08: SQL Considered Harmful

Summary

Speaker: Shannon Posniewski, Cryptic

I was expecting something shockingly naive and/or stupid from the title of the session. The first thing the speaker said was that the title was completely wrong, so I ran with that. With that out of the way, the talk was fine, although small things kept coming out during the talk that were hard to believe or worrying claims.

So it was going OK, until … right at the end, just before the Q&A, and partly during the Q&A, the speaker dropped some serious shockers:

– they’ve never actually used this custom DB: everything that the speaker says RDBMS’s can’t do he doesn’t yet know for sure that his proprietary DB server can do either
– they never bothered to try alternative SQL servers, they just used MS SQL Server
– they’d never heard of Versant until after they’d developed this proprietary DB – apparently they never did any evaulating of other possible DB engines
– they’ve created their own proprietary query language based on C and C syntax, and don’t find any value in using SQL

Disclaimer: I recently started to work for the company that published CoH/CoV, and apparently forced the speaker into using SQL and an RDBMS in the first place. I didn’t join the company until a long time after this had all happened, and I work in a different division in a different continent, so everything in this talk was new to me (although I admit I’d be fascinated to hear what NCsoft’s thoughts were on these development directions)

My own commentary in [ square brackets ], any mistakes/misunderstandings my own fault :).

Title

The title is deliberately wrong and provocative – it’s really that current vendor solutions are inappropriate for the needs of MMOs

Databases 101

[Adam: spent 15 mins explaining basics of modern database usage:

ACID explanation

Relational explanation

SQL explanation

Keys explanation]

SQL DBMSs are safe, tested, used by practically everyone, lots of experience of using them exists.

Why not use it [SQL + RDBMS]?

SQL is optimized for the general case, and is generally “OK” – it will never be terrible. It will always work..

Sometimes queries take a long time to setup – creating an index automatically, etc.

It self-optimizes on the fly with caching etc, but that introduces stalls at unpredictable times. Unpredictable latency caused by this is very bad because sometimes you e.g. try to pick up something in the game world from within the game client, and nothing happens, so user is confused – because of that latency.

[Adam: if this strikes you as very strange behaviour for an RDBMS, you are not alone. It might help to know that the end of the talk, the speaker revealed that they’d only ever used Microsoft SQL Server and hadn’t bothered trying out any alternative RDBMS’s, and didn’t really know much about the others]

Inserts are extremely slow – just for each single row that you insert. Because of all the overhead of checking for consistency etc, inserts are slow.

Once you’ve got data into the database, the per-row overhead to update anything in the row is also fairly expensive (although you can do several changes per row at once for almost no extra penalty).

Object-Relational Impedance Mismatch

Trying to map OO systems into Relational databases is painful and shoehorning it comes with lots of performance loss.

Every class I’ve ever seen in database architecture talks about normalizing all your data. Doing it that way – the “right way” – can really backfire on you.

[Adam: … apart from all the other classes in DBA, the ones that also talk about DE-normalizing data. I went to different database classes than the speaker, apparently :) ]

On COH we had this idea that we’d store statistics on each individual player – how many times did they pick stuff up, how many times did they kill people, how long they spent online, the number of times they trash-talked in chat, etc.

[Adam: i.e. gameplay metrics, a la the work done by Bungie with Halo heatmaps, or by various 3rd party companies at the moment in the games industry. Cool stuff]

We made a table where there was one row-insert per player PER piece of data that had changed. Some of those stats change all the time, like “how long have they been online”.

So the traffic to the database was extremely high, because each thing had to be updated constantly.

[Adam: Well … IME this isn’t the *normal* way of dealing with that use-case, so I’m not surprised it bombed. If they’d approached it in a more standard way, using simple multi-server database organization, I suspect it might have worked out fine from the start]

So … we threw out all the tranditional DB concepts, and made one column per stat, with no actual names for columns, instead just have “Stat 1 value”, “stat 2 value”, etc.

[Adam: I really don’t get how this ever even seemed like a sensible idea, but there was a long queue of people asking questions at the end so I didn’t manage to ask before I had to go to next session]

In the interests of performance, we were twisting our data, either in code, or in the DB, or in both places. This became a huge impedance to our development process.

Solution: Caching

Problems:
– unexpected unpredictable 1-second and multi-second latencies

[Adam: I’m really suspcipious about this problme – where is it coming from?]

Solution:
– created a server process that all DB calls from other server processes were indirected through
– it had a second thread that was busy writing stuff out to the DB
– appeared that all writes were instantaneous, but they were being buffered.
– i.e. write-through cache
– all changes you made in game-server memory would show instantly as having been committed

Problems2:
– throughput was still far too slow, although at least now there was no latency

Solution:
– instead of updating every stat every time it changes, we did “snapshots” and only occasionally save them
– …but means that there was no ACID properties being used any more except for Durability

Since that was shot anyway, we went whole-hog, basically. What we decided was we’d keep all the data for the players on a given map on that game server. The database was being used just as a storage engine instead of as a transaction engine. The business logic is now all inside the game-server, and we just periodically flush it all out to the DB.

Problems3:
– lost data
– server-server comms

Solution:
– flush data faster

Conclusion: we optimized the servers, the database engine, to the point that we lost all the benefits of RDBMSs. All the ACID stuff etc had just become a hindrance to us.

Final solution: Cryptic DB

Requirements
– regain ACID benefits
– low predictable latency
– high read/write throughput
– minimize code-db impedance (the schemas changed very often, and that was very painful)
– backup and restore easily

[Adam: suggests that hiring a professional DBA, and using a commercial OODB, would be the best option to start with]

Relaxations
– offline character access is unimportant
– domain integrity became the responsibility of the programmer
– general ad-hoc queries can be slow (we don’t use them)
– SQL not needed

Background – some stuff they’d already done

In CoX we took all static data (mission data, character class template etc) and didn’t put it into a DB. We stored it on disk and loaded it into memory manually.

To help us do this, we created a “StructParser” [just a serialization tool]. This was what we used to start building Cryptic DB. We wrote a pre-parser that looked at our C structs in the source code and wrote out parserss/serializers for it.

Using this for live data as well as static gave us benefits of not having two different storage systems in parallel.

[Adam: in general, that’s a pretty good thing to achieve]

Allowed us to create generic mutators, because we know the datatypes and memory layout. So, now we have a generic mutator database. Only simple operations at this point, like mathematical addition of a single counter.

Complicated things like checking whether to allow a trade (lots of rules and things to check) need business logic. We decided to sepaaret the data and logic away from the database, and put it into our game server.

A particular game server owns the data for an entity. E.g. a team server would know about teams, and anything that has to do with teams will be held on that server. It knows how to modify those entities.

[architecture diagram]

Most transactional actions are now being done locally only, so execute extremely fast.

To make it easier to develop, we created auto-transactions. These instrument and change the code so that transactions are written in C. We did field-level locking on the items we were modifying.

[C code with a bunch of special macros implementing a basic transaction, but mostly written in C]

[Adam: here he gave a simple example of distributed transaction processing (although he didn’t call it that), enough to convey the general idea, although he ommitted the details that make DTP hard / interesting technically]

Cryptic DB uses basic journal-based techniques for data consistency longterm.

Conclusion

We’ve tried stored procedures etc, doesn’t work. Other MMO devs have had to resort to using solid-state disks etc to make it work fast enough. So, don’t just believe what they say in database classes.

[Adam: the mention of SSD is disingenuous: only one MMO company has decided to try that, and their from Iceland so they’re a bit odd anyway (/me ducks and runs for cover) – but seriously for them it’s part of an overall game-design and/or technology strategy – they aimed for and achieved gameplay that Cryptic hasn’t (single-server world)]

Q: Did you look at federated data as a solution?

We looked a little, didn’t try to implement it, because we’d worked out a proprietary solution already that we reckoned would work.

Latency would still be a problem anyway with it.

Q: How do you handle corruption of data in your DTP?

There’s certain things we relax – we don’t worry about cosmic rays flipping bits randomly.

Within the database, everything is checksummed.

Q: what’s the throughput of your cross-server transactions? How many of your servers are local versus distributed?

I would say at a guess we have a very small number of distributed TP’s. We have a zoned games, so almost everything is local, probably less than 5%.

I have no idea how much slower cross-server transactions are. I don’t think they’re particuarly slower, the biggest problem is just the overhead of having to wait for the locking of data. Probably a couple of times slower, but not an order of magnitude.

Q: you’re saying that with your proprietary “Cryptic DB” you keep all data in memory. What do you do when you hit the limit of RAM in a machine?

It’s a pretty straightforward task to virtualize your memory. If you’re WoW you’ve got no chance of fitting it all into memory, so you only keep in memory the working set of characters that are online right now.

If you still run out, then we make a second shard. We’re talking tens and hundreds of thousands of simultaneous players before we get to that problem.

Q: Do you cluster machines to make the DB?

It’s one giant machine with a lot of memory in it

[Adam: which would suggest you could benefit from SSD, maybe… :P]

Q: Have you considered co-locating the data cache with the servers themselves?

No. We haven’t needed to consider it yet.

You can compress the data in memory and other tricks, so it’s not a problem yet.

Q: how do you do backups?

A set of three servers work together to make sure they can handle the load and so that they can ensure they are being back up correctly.

These are effectively zipped flat files, for simplicity. There’s one server frontend to rest of games, responding to read requests and making in-memory changes for transactions that come through. The second server is slaved to it, and every once in a while writes out an image of it’s DB in a safe way. The last server picks up the things that have been changed and merges them into the master DB. Master DB is then copied back to first and second servers.

At any one spot in the change the data is consistent within that one machine. There may be a huge journal at some point, but it will be consistent.

It turns out that MMO players care more about not being able to play than about losing ten minutes of game data. By about an order of magnitude.

When someone claims they lost a Sword of Death, don’t worry too much, just give them a new sword, because you save a lot of pain and money in customer support costs.

Q: Sounds like you’ve made an OODB, how does that compare with Objectivity or Versant? And the disadvantage of an OODB seems to be the loss of querying ability, so you can’t research behaviour or do metrics to examine your gameplay etc – how do you sovle that?

When we started working on this we didn’t know that Versant existed. We only found out when we first spoke about it publically.

Other ones we knew about but we already had what we wanted from a previous game that was fully tested by us and used in produciton. We were going to have to worry about learning the vendor specifics, and converting our data and pipeline stuff.

I’m also not convinced that an OODB would have worked for the stuff we’re trying to do. We’re writing a lot of code, and the fact you can write AUTO_TRANSACT at the top of a function, and knowing it will Just Work is worht it’s weight in gold.

Re: adhoc queries, we have created a proprietary query language that looks a lot more like C than it does like SQL.

If for some reason we lack a feature we need in our DB, it’s so easy for the DB to interact with C code that you could just write the damn thing in C and have the DB execute that.

Q: Is this in production?

No

Q: How do you test it?

We eat our own dogfood. The true test will be the next time we have a game that goes live.

We have a zillion stress tests, and run that stuff routinely, of course. Right now we just have test-harnesses that fake it.

Q: Could you have made Cryptic DB from the start, or did you only learn it by using SQL from the start? Did you get anyt benefit from SQL?

We got nothing from SQL, it was forced upon us by our publisher who forced us to use SQL. NCsoft were doing network operations and backups etc, so they said look we already know how to use it, and go ahead with it.

I have no idea if we’d have used it given free chance.

We learnt from it that variant latency is a killer, and learnt where the major bottlenecks were in MMO data writing, so we did learn a lot by working with SQL that was transferable knowledge.

Q: How do you handle verisoning on the data, like schema changes?

The actual DB if it sees a field that it doesn’t know, it just ignores it silently. If we add a field that isn’t there, it silently creates it.

We set things to zero or null to create them.

Q: Do you have a central contorl of the DB design, the schema?

Every programmer can add whatever fields they want. We haven’t run into any problems with it yet.

There’s a certain data structure which is the character definition. At the start of the project someone sat down and designed that, and that doesn’t change often. What does change is the occasional adding of a new data field every now and then [i.e. no additional tables being added]

Q: How much did it cost in money and hours to build?

Whole system from start to finish was two people for maybe a year.

Q: Did you try multiple implementations of your SQL server? Different vendors?

No, we did not. There was no indication to us why different ones would ever be faster, so it didn’t seem worth investigating it. I think it is problems endemic to all databases in general.

unless, maybe, you just spent huge amounts of time and effort on database administration. We don’t know how many players we will have, so we can’t actually define optimal schemas and things in advance.

One of our main goals was to spend more time on game coding and less on database coding.

Q: did you spend a lot of time profiling your SQL server?

I think we tried profiling everything, automatically created indexes, stored procedures, etc. We got maybe 15%, 20% improvement, and DBA’s think that’s great, but we were looking for multiple hundreds of percent improvement in performance.

Q: Does your new DB store quests?

It stores state for how you’re progressing through, but not yet the actual quest template data.

Q: Have you looked at any vertical file DB systems?

No.

It’s entirely possible that it’s much faster than using Relational. We were already used to using SQL so we kept in that direction.

Q: will you be releasing this as a third-party DB?

Maybe – but not at least until we’ve put it into production for ourselves!

16 replies on “GDC08: SQL Considered Harmful”

“Whole system from start to finish was two people for maybe a year.”

My god… they spent 2 man years developing a database system that they think will out-perform systems that have spent thousands of man years in development? I have no doubt this will crash and burn.

It is a bit surprising if there was indeed no effort to look at existing products to handle their use case (which is not specific to games).

High volume of inserts is indeed a performance hot spot, and most generic relational database systems don’t deal with this all that well (relatively speaking). But then the nature of the data (for metrics, logs, and audits) is such that backing it with full relational model is not ideal anyway (most of it is either aggregate or sequential data which lends itself better to a “flat” record model).

There are significant industries that need to deal with this same issue (high level of concurrent inserts for logs and audits, with low and/or predictable latency) and major database vendors do offer products that are specifically tailored for this use case (MS SQL not being one of them).

Spending few weeks/months for research might have saved a year’s worth of implementation effort.

Well, first off, thank you for coming to my talk!

Every MMORPG developer I’ve spoken to has had major performance problems with their database. They have tried to solve this problem in several ways. I didn’t make a big enough point of this during the talk, I guess, but I was presenting what *Cryptic’s* experience was with a relational DB and what we decided to do to solve those problems. I wasn’t trying to sell anything or saying that “this is *the* solution.” It’s merely what we decided to do.

Another studio may have different needs (and competencies) from ours, and so their decision might be different. We did the best thing for us.

I could hardly miss a talk with such a provocative title :).

It’s really interesting to know what you tried, and I like that aspect, but a lot of people in the audience are going to take what you say as true, and a lot of what you did say flies in the face of standard practices that are easily researchable on the web these days.

I think if you focussed more on the “what we tried in order to solve the problems” BUT also went and researched other solutions to those kinds of problems, and presented the two sets of approaches side-by-side, and perhaps looked at why it was that you had so much trouble finding out that stuff in the first place, then most or all of my criticisms would disappear.

Count me as an MMORPG developer who hasn’t seen major performance problems with their database. On the other hand, we started off with a write-through cache layer between the game server and the database (that also stored snapshots every ~15 minutes rather than constant updates). Since then we’ve added a last-ditch save effort that writes data out to disk in the event of a crash, so that when the server comes back up, zero data is lost (naturally, total system failure isn’t accounted for, but that’s what the constant snapshots are for).

On the other hand, because we’re already using SQL Server as a storage engine, it’s fairly easy to mix in some relational data (in our case, the goal is to post-process saved data and make a second, non-authoritative copy in the database that is amenable to data mining) as we go along. The data mining can be done by pretty much anyone at the company with a little SQL experience, and because we get transactional replication for cheap (that is, it’s already coded), we can make offline copies of that data as well.

[Adam: I’m really suspcipious about this problme – where is it coming from?]

We’ve run into the same thing on our SQL Server installations. The SQL Server collects statistics and uses them to change the optimization plan used for queries. On a busy database this can introduce a delay of up to 30 seconds at the moment the plan changes.

SQL Server 2005 introduced a new option to do this statistic updating in the background so queries don’t get queued up behind it. CoH came out in 2004, though, so was a little late to do them any good.

It’s certainly possible (thought I personally don’t think it’s probable) that some other DB out there would have worked perfectly. I regret that I wasn’t clear enough about that in the talk.

I realize now that the talk should have emphasized the static data storage system we had more. It was the existence of this system which precipitated our approach to Cryptic DB. Since we already had a method of defining data structures with metadata and storing and retrieving them generically, we already had a big jump on the system. The DB impedance is practically zero, which is a huge boon.

Matthew: The write-through cache method operates well for City of Heroes/Villains and is still used today. Debugging it wasn’t fun, and we lost real ACID game transactions in the process. In our new engine, we need a lot more performance than that could provide (15x is the goal, IIRC) and we want game transactions (e.g. both sides of a trade) to be ACID to the DB. A write-through cache does not typically provide this.

I appreciate all your comments. If I ever give the talk again, I shall be certain to declare the shortcomings of the talk itself at the outset. :-)

Matthew: Just read about your approach on your site, which is mainly to use Blobs for storing data. We had a mandate to expose all the internal fields by the publisher so they could build tools, which they never did :-), so Blobs weren’t allowed. The Cryptic DB approach is to speak “Blob” natively (more or less).

It’s not just this talk that argumes generic RDBMSs like MSSQL can do everything badly and more specialised engines are needed for different functions.

This paper is an interesting take on how a specific OLTP engine could be designed:
http://www.vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf

Summary: Current RDBMSs are provably useless for everything but OLTP and actually they’re rubbish at that too. Suggested replacement (called H-Store) OLTP design using distributed nodes with all data in main memory. No redo log (failed machines rebuild from good ones), undo log in main memory and not persisted beyond completion of transaction. Clustering gives availability and scalability advantages. Since OLTP transactions are light-weight and fast, make it single threaded for each node and lose multi-threading management overhead. Assumptions made about the types of transactions to be run (no ad-hoc queries, no analysis queries etc.) and query planning and execution simplified based on that. They’ve built a prototype that wins at TPC-C by a factor of 82.

It’s not actually that far from Crypic DB, if I understand it correctly.

Personally, having historical game metrics in SQL accessable relational databases seems like the most sensible approach. Whether the write-heavy ‘live’ game database should also be in a conventional disk based RDBMS is worth discussion I think.

Heyo. Ben Zeigler from Cryptic Studios. I was the other guy answering questions after the talk (Shannon’s a way better speaker than me, trust me :) ). Anyway, I just put up some responses to some of the general comments people have at my blog.

Another note is that I probably overstated it when I said “Whole System in 2 programmer years”. What I meant was the Transaction and DB part of the system. It was built on top of a data serialization engine developed and in production in CoH. Basically, it was 2 programmer years to transition from a robust data parser to a high-performance DB and transaction engine.

Okay, I just got around to reading this post for the first time today. Let me just say that the way COH initially did their stat collection was bad, but the switch to generic stat fields was WAY WORSE. Holy crap, who thought of that? Oh wait, I know: it was a programmer who thought of that. Not a data analyst. Anyone who actually wants to write queries to report off a database would have kicked and screamed until they did something more sensible.

What’s with the obsession with having one unified table for all your stats? That’s insane. It’s a fucking RELATIONAL database.

I can only hope that the talks I have given on this have steered developers away from making those mistakes in the future.

..”one column per stat, with no actual names for columns, instead just have “Stat 1 value”, “stat 2 value””….that sounds very like traditional OLAP design to me, which is all about structuring data ready for analysis. The “how to get proper stats info” is an entirely different problem to the “how to highly performant db read-write”.

Comments are closed.