Hi, I'm Adam. Email me about: CTOs, Mobile/iOS, Project Management, and Development

GDC08: SQL Considered Harmful

Summary

Speaker: Shannon Posniewski, Cryptic

I was expecting something shockingly naive and/or stupid from the title of the session. The first thing the speaker said was that the title was completely wrong, so I ran with that. With that out of the way, the talk was fine, although small things kept coming out during the talk that were hard to believe or worrying claims.

So it was going OK, until … right at the end, just before the Q&A, and partly during the Q&A, the speaker dropped some serious shockers:

– they’ve never actually used this custom DB: everything that the speaker says RDBMS’s can’t do he doesn’t yet know for sure that his proprietary DB server can do either
– they never bothered to try alternative SQL servers, they just used MS SQL Server
– they’d never heard of Versant until after they’d developed this proprietary DB – apparently they never did any evaulating of other possible DB engines
– they’ve created their own proprietary query language based on C and C syntax, and don’t find any value in using SQL

Disclaimer: I recently started to work for the company that published CoH/CoV, and apparently forced the speaker into using SQL and an RDBMS in the first place. I didn’t join the company until a long time after this had all happened, and I work in a different division in a different continent, so everything in this talk was new to me (although I admit I’d be fascinated to hear what NCsoft’s thoughts were on these development directions)

My own commentary in [ square brackets ], any mistakes/misunderstandings my own fault :).

Title

The title is deliberately wrong and provocative – it’s really that current vendor solutions are inappropriate for the needs of MMOs

Databases 101

[Adam: spent 15 mins explaining basics of modern database usage:

ACID explanation

Relational explanation

SQL explanation

Keys explanation]

SQL DBMSs are safe, tested, used by practically everyone, lots of experience of using them exists.

Why not use it [SQL + RDBMS]?

SQL is optimized for the general case, and is generally “OK” – it will never be terrible. It will always work..

Sometimes queries take a long time to setup – creating an index automatically, etc.

It self-optimizes on the fly with caching etc, but that introduces stalls at unpredictable times. Unpredictable latency caused by this is very bad because sometimes you e.g. try to pick up something in the game world from within the game client, and nothing happens, so user is confused – because of that latency.

[Adam: if this strikes you as very strange behaviour for an RDBMS, you are not alone. It might help to know that the end of the talk, the speaker revealed that they’d only ever used Microsoft SQL Server and hadn’t bothered trying out any alternative RDBMS’s, and didn’t really know much about the others]

Inserts are extremely slow – just for each single row that you insert. Because of all the overhead of checking for consistency etc, inserts are slow.

Once you’ve got data into the database, the per-row overhead to update anything in the row is also fairly expensive (although you can do several changes per row at once for almost no extra penalty).

Object-Relational Impedance Mismatch

Trying to map OO systems into Relational databases is painful and shoehorning it comes with lots of performance loss.

Every class I’ve ever seen in database architecture talks about normalizing all your data. Doing it that way – the “right way” – can really backfire on you.

[Adam: … apart from all the other classes in DBA, the ones that also talk about DE-normalizing data. I went to different database classes than the speaker, apparently :) ]

On COH we had this idea that we’d store statistics on each individual player – how many times did they pick stuff up, how many times did they kill people, how long they spent online, the number of times they trash-talked in chat, etc.

[Adam: i.e. gameplay metrics, a la the work done by Bungie with Halo heatmaps, or by various 3rd party companies at the moment in the games industry. Cool stuff]

We made a table where there was one row-insert per player PER piece of data that had changed. Some of those stats change all the time, like “how long have they been online”.

So the traffic to the database was extremely high, because each thing had to be updated constantly.

[Adam: Well … IME this isn’t the *normal* way of dealing with that use-case, so I’m not surprised it bombed. If they’d approached it in a more standard way, using simple multi-server database organization, I suspect it might have worked out fine from the start]

So … we threw out all the tranditional DB concepts, and made one column per stat, with no actual names for columns, instead just have “Stat 1 value”, “stat 2 value”, etc.

[Adam: I really don’t get how this ever even seemed like a sensible idea, but there was a long queue of people asking questions at the end so I didn’t manage to ask before I had to go to next session]

In the interests of performance, we were twisting our data, either in code, or in the DB, or in both places. This became a huge impedance to our development process.

Solution: Caching

Problems:
– unexpected unpredictable 1-second and multi-second latencies

[Adam: I’m really suspcipious about this problme – where is it coming from?]

Solution:
– created a server process that all DB calls from other server processes were indirected through
– it had a second thread that was busy writing stuff out to the DB
– appeared that all writes were instantaneous, but they were being buffered.
– i.e. write-through cache
– all changes you made in game-server memory would show instantly as having been committed

Problems2:
– throughput was still far too slow, although at least now there was no latency

Solution:
– instead of updating every stat every time it changes, we did “snapshots” and only occasionally save them
– …but means that there was no ACID properties being used any more except for Durability

Since that was shot anyway, we went whole-hog, basically. What we decided was we’d keep all the data for the players on a given map on that game server. The database was being used just as a storage engine instead of as a transaction engine. The business logic is now all inside the game-server, and we just periodically flush it all out to the DB.

Problems3:
– lost data
– server-server comms

Solution:
– flush data faster

Conclusion: we optimized the servers, the database engine, to the point that we lost all the benefits of RDBMSs. All the ACID stuff etc had just become a hindrance to us.

Final solution: Cryptic DB

Requirements
– regain ACID benefits
– low predictable latency
– high read/write throughput
– minimize code-db impedance (the schemas changed very often, and that was very painful)
– backup and restore easily

[Adam: suggests that hiring a professional DBA, and using a commercial OODB, would be the best option to start with]

Relaxations
– offline character access is unimportant
– domain integrity became the responsibility of the programmer
– general ad-hoc queries can be slow (we don’t use them)
– SQL not needed

Background – some stuff they’d already done

In CoX we took all static data (mission data, character class template etc) and didn’t put it into a DB. We stored it on disk and loaded it into memory manually.

To help us do this, we created a “StructParser” [just a serialization tool]. This was what we used to start building Cryptic DB. We wrote a pre-parser that looked at our C structs in the source code and wrote out parserss/serializers for it.

Using this for live data as well as static gave us benefits of not having two different storage systems in parallel.

[Adam: in general, that’s a pretty good thing to achieve]

Allowed us to create generic mutators, because we know the datatypes and memory layout. So, now we have a generic mutator database. Only simple operations at this point, like mathematical addition of a single counter.

Complicated things like checking whether to allow a trade (lots of rules and things to check) need business logic. We decided to sepaaret the data and logic away from the database, and put it into our game server.

A particular game server owns the data for an entity. E.g. a team server would know about teams, and anything that has to do with teams will be held on that server. It knows how to modify those entities.

[architecture diagram]

Most transactional actions are now being done locally only, so execute extremely fast.

To make it easier to develop, we created auto-transactions. These instrument and change the code so that transactions are written in C. We did field-level locking on the items we were modifying.

[C code with a bunch of special macros implementing a basic transaction, but mostly written in C]

[Adam: here he gave a simple example of distributed transaction processing (although he didn’t call it that), enough to convey the general idea, although he ommitted the details that make DTP hard / interesting technically]

Cryptic DB uses basic journal-based techniques for data consistency longterm.

Conclusion

We’ve tried stored procedures etc, doesn’t work. Other MMO devs have had to resort to using solid-state disks etc to make it work fast enough. So, don’t just believe what they say in database classes.

[Adam: the mention of SSD is disingenuous: only one MMO company has decided to try that, and their from Iceland so they’re a bit odd anyway (/me ducks and runs for cover) – but seriously for them it’s part of an overall game-design and/or technology strategy – they aimed for and achieved gameplay that Cryptic hasn’t (single-server world)]

Q: Did you look at federated data as a solution?

We looked a little, didn’t try to implement it, because we’d worked out a proprietary solution already that we reckoned would work.

Latency would still be a problem anyway with it.

Q: How do you handle corruption of data in your DTP?

There’s certain things we relax – we don’t worry about cosmic rays flipping bits randomly.

Within the database, everything is checksummed.

Q: what’s the throughput of your cross-server transactions? How many of your servers are local versus distributed?

I would say at a guess we have a very small number of distributed TP’s. We have a zoned games, so almost everything is local, probably less than 5%.

I have no idea how much slower cross-server transactions are. I don’t think they’re particuarly slower, the biggest problem is just the overhead of having to wait for the locking of data. Probably a couple of times slower, but not an order of magnitude.

Q: you’re saying that with your proprietary “Cryptic DB” you keep all data in memory. What do you do when you hit the limit of RAM in a machine?

It’s a pretty straightforward task to virtualize your memory. If you’re WoW you’ve got no chance of fitting it all into memory, so you only keep in memory the working set of characters that are online right now.

If you still run out, then we make a second shard. We’re talking tens and hundreds of thousands of simultaneous players before we get to that problem.

Q: Do you cluster machines to make the DB?

It’s one giant machine with a lot of memory in it

[Adam: which would suggest you could benefit from SSD, maybe… :P]

Q: Have you considered co-locating the data cache with the servers themselves?

No. We haven’t needed to consider it yet.

You can compress the data in memory and other tricks, so it’s not a problem yet.

Q: how do you do backups?

A set of three servers work together to make sure they can handle the load and so that they can ensure they are being back up correctly.

These are effectively zipped flat files, for simplicity. There’s one server frontend to rest of games, responding to read requests and making in-memory changes for transactions that come through. The second server is slaved to it, and every once in a while writes out an image of it’s DB in a safe way. The last server picks up the things that have been changed and merges them into the master DB. Master DB is then copied back to first and second servers.

At any one spot in the change the data is consistent within that one machine. There may be a huge journal at some point, but it will be consistent.

It turns out that MMO players care more about not being able to play than about losing ten minutes of game data. By about an order of magnitude.

When someone claims they lost a Sword of Death, don’t worry too much, just give them a new sword, because you save a lot of pain and money in customer support costs.

Q: Sounds like you’ve made an OODB, how does that compare with Objectivity or Versant? And the disadvantage of an OODB seems to be the loss of querying ability, so you can’t research behaviour or do metrics to examine your gameplay etc – how do you sovle that?

When we started working on this we didn’t know that Versant existed. We only found out when we first spoke about it publically.

Other ones we knew about but we already had what we wanted from a previous game that was fully tested by us and used in produciton. We were going to have to worry about learning the vendor specifics, and converting our data and pipeline stuff.

I’m also not convinced that an OODB would have worked for the stuff we’re trying to do. We’re writing a lot of code, and the fact you can write AUTO_TRANSACT at the top of a function, and knowing it will Just Work is worht it’s weight in gold.

Re: adhoc queries, we have created a proprietary query language that looks a lot more like C than it does like SQL.

If for some reason we lack a feature we need in our DB, it’s so easy for the DB to interact with C code that you could just write the damn thing in C and have the DB execute that.

Q: Is this in production?

No

Q: How do you test it?

We eat our own dogfood. The true test will be the next time we have a game that goes live.

We have a zillion stress tests, and run that stuff routinely, of course. Right now we just have test-harnesses that fake it.

Q: Could you have made Cryptic DB from the start, or did you only learn it by using SQL from the start? Did you get anyt benefit from SQL?

We got nothing from SQL, it was forced upon us by our publisher who forced us to use SQL. NCsoft were doing network operations and backups etc, so they said look we already know how to use it, and go ahead with it.

I have no idea if we’d have used it given free chance.

We learnt from it that variant latency is a killer, and learnt where the major bottlenecks were in MMO data writing, so we did learn a lot by working with SQL that was transferable knowledge.

Q: How do you handle verisoning on the data, like schema changes?

The actual DB if it sees a field that it doesn’t know, it just ignores it silently. If we add a field that isn’t there, it silently creates it.

We set things to zero or null to create them.

Q: Do you have a central contorl of the DB design, the schema?

Every programmer can add whatever fields they want. We haven’t run into any problems with it yet.

There’s a certain data structure which is the character definition. At the start of the project someone sat down and designed that, and that doesn’t change often. What does change is the occasional adding of a new data field every now and then [i.e. no additional tables being added]

Q: How much did it cost in money and hours to build?

Whole system from start to finish was two people for maybe a year.

Q: Did you try multiple implementations of your SQL server? Different vendors?

No, we did not. There was no indication to us why different ones would ever be faster, so it didn’t seem worth investigating it. I think it is problems endemic to all databases in general.

unless, maybe, you just spent huge amounts of time and effort on database administration. We don’t know how many players we will have, so we can’t actually define optimal schemas and things in advance.

One of our main goals was to spend more time on game coding and less on database coding.

Q: did you spend a lot of time profiling your SQL server?

I think we tried profiling everything, automatically created indexes, stored procedures, etc. We got maybe 15%, 20% improvement, and DBA’s think that’s great, but we were looking for multiple hundreds of percent improvement in performance.

Q: Does your new DB store quests?

It stores state for how you’re progressing through, but not yet the actual quest template data.

Q: Have you looked at any vertical file DB systems?

No.

It’s entirely possible that it’s much faster than using Relational. We were already used to using SQL so we kept in that direction.

Q: will you be releasing this as a third-party DB?

Maybe – but not at least until we’ve put it into production for ourselves!

16 thoughts on “GDC08: SQL Considered Harmful”

Comments are closed.