This year I’m working on two different projects that need an Entity System (ES). One of them is a non-game app written natively on iOS + Android. The other is an FPS in Unity3D.
There are good, basic Open-Source ES’s out there today (and c.f. the sidebar there). I tried porting a few, but none of them were optimized for performance, and most of them were too tightly coupled to a single programming language or platform. I’ve started a new ES of my own – Aliqua.org – to fix these problems, and I’m already using it in an app that’s in alpha-testing.
I’ll be blogging experiences, challenges, and ideas as I go.
2016 Update: Support me on Patreon, writing about Entity Systems and sharing tech demos and code examples
Background: focus on ES theory, or ES practice?
If you’re new to ES’s, you should read my old blog posts (2007 onwards), or some of the source code + articles from the ES wiki.
My posts focussed on theory: I wanted to inspire developers, and get people using an ES effectively. I was fighting institutionalised mistakes – e.g. the over-use of OOP in ES development – and I wrote provocatively to try and shock people out of their habits.
But I avoided telling people “how” to implement their ES. At the extreme, I feared it would end up specifying a complete Game Engine:
Give a man a game engine and he delivers a game. Teach a man to make a game engine and he never delivers anything.
— Juuso ○° (@sandbaydev) November 20, 2013
…OK. Fine. 7 years later, ES’s are widely understood and used well. It’s time to look at the practice: how do you make a “good” ES?
NB: I’ve always assumed that well-resourced teams – e.g. AAA studios – need no help writing a good ES. That’s why I focussed on theory: once you grok it, implementation concerns are no different from writing any game-engine code. These posts are aimed at non-AAA teams: those who lack the money (or expertise) to make an optimized ES first time around.
For my new ES library, I’m starting with the basics: Data Structures, and how you store your ES data in memory.
Where you see something that can be done better – please comment!
Aside on Terminology: “Processors, née Systems”
ES “Systems” should be batch-processing algorithms: you give them an array/stream of homogeneous data, and they repeat one algorithm on each row/item. Calling them “Processors” instead of “Systems” reduces confusion.
Why care about Data Structures?
ES “Systems” should be batch-processing algorithms: you give them an array/stream of homogeneous data, and they repeat one algorithm on each row/item. Calling them “Processors” instead of “Systems” reduces confusion.
There is a tension at the heart of Entity Systems:
- In an ES game, we design our code to be Functional: independent, data-oriented, highly efficient for streaming, batching, and multi-threaded execution. Individual Processors should be largely independent, and easy to split out onto different CPU cores.
- With the “Entity” (ID) itself, we tie those Functional chunks together into big, messy, inter-dependent, cross-functional … well, pretty much: BLOBs. And we expect everything to Just Work.
If our code/data were purely independent, we’d have many options for writing high-performance code in easy ways.
If our data were purely chunked, fixed at compile-time, we’d have tools that could auto-generate great code.
But combining the two, and muddling it around at runtime, poses tricky problems. For instance:
- Debugging: we’ve gone from clean, separable code you can hold in your head … to amorphous chunks that keep swelling and contracting from frame-to-frame. Ugh.
- Performance: we pretend that ES’s are fast, cache-efficient, streamable … but at runtime they’re the opposite: re-assembled every frame from their constituent parts, scattered all over memory
- Determinism: BLOBs are infamously difficult to reason about. How big? What’s in them? What’s the access cost? … we probably don’t know.
With a little care, ES’s handle these challenges well. Today I’m focussing on performance. Let’s look at the core need here:
- Each frame, we must:
- Iterate over all the Processors
- For each Processor:
- Establish what subset of Entity/Component blobs it needs (e.g. “everything that has both a Position and a Velocity”)
- Select that from the global Entity/Component pool
- Send the data to the CPU, along with the code for the Processor itself
The easiest way to implement selection is to use Maps (aka Associative Arrays, aka Dictionaries). Each Processor asks for “all Components that meet [some criteria]”, and you jump around in memory, looking them up and putting them into a List, which you hand to the Processor.
But Maps scatter their data randomly across RAM, by design. And the words “jump around in memory” should have every game-developer whimpering: performance will be bad, very bad.
NB: my original ES articles not only use Maps, but give complete source implementations using them. To recap: even in 2011, Android phones could run realtime 30 FPS games using this. It’s slow – but fast enough for simple games
Volume of data in an ES game
We need some figures as a reference point. There’s not enough detailed analysis of ES’s in particular, so a while back I wrote an analysis of Components needed to make a Bomberman clone.
…that’s effectively a high-end mobile game / mid-tier desktop game.
Reaching back to 2003, we also have the slides from Scott’s GDC talk on Dungeon Siege.
…that’s effectively a (slightly old) AAA desktop game.
From that, we can predict:
- Number of Component-types: 50 for AA, 150 for AAA
- Number of unique assemblages (sets of Component-types on an Entity): 1k for AA, 10k for AAA
- Number of Entities at runtime: 20k for AA, 100k for AAA
- Size of each Component in bytes: 64bits * 10-50 primitives = 100-500 bytes
How do OS’s process data, fast?
In a modern game the sheer volume of data slows a modern computer to a crawl – unless you co-operate with the OS and Hardware. This is true of all games. CPU and RAM both run at a multiple of the bus-speed – the read/write part is massively slow compared to the CPU’s execution speed.
OS’s reduce this problem by pre-emptively reading chunks of memory and caching them on-board the CPU (or near enough). If the CPU is processing M1, it probably wants M2 next. You transfer M2 … Mn in parallel, and if the CPU asks for them next, it doesn’t have to wait.
Similarly, RAM hardware reads whole rows of data at once, and can transfer it faster than if you asked for each individual byte.
Net effect: Contiguous memory is King
If you store your data contiguously in RAM, it’ll be fast onto the Bus, the CPU will pre-fetch it, and it’ll remain in cache long enough for the CPU(s) to use it with no extra delays.
NB: this is independent of the programming-language you’re using. In C/C++ you can directly control the data flow, and manually optimize CPU-caching – but whatever language you use, it’ll be compiled down to something similar. Careful selection and use of data-structures will improve CPU/cache performance in almost all languages
But this requires that your CPU reads and writes that data in increasing order: M1, M2, M3, …, M(n).
With Data Structures, we’ll prioritize meeting these targets:
- All data will be as contiguous in RAM as possible; it might not be tightly-packed, but it will always be “in order”
- All EntitySystem Processors will process their data – every frame (tick) – in the order it sits in RAM
- NOTE: a huge advantage of ES’s (when used correctly) is that they don’t care what order you process your gameobjects. This simplifies our performance problems
- Keep the structures simple and easy to use/debug
- Type-safety, compile-time checks, and auto-complete FTW.
The problem in detail: What goes wrong?
When talking about ES’s we often say that they allow or support contiguous data-access. What’s the problem? Isn’t that what we want?
NB: I’ll focus on C as the reference language because it’s the closest to raw hardware. This makes it easier to describe what’s happening, and to understand the nuances. However, these techniques should also be possible directly in your language of choice. e.g. Java’s ByteBuffer, Objective-C’s built-in C, etc.
Usually you see examples like a simple “Renderer” Processor:
- Reads all Position components
- (Position: { float: x, float y })
- Each tick, draws a 10px x 10px black square at the Position of each Component
We can store all Position components in a tightly-packed Array:
This is the most efficient way a computer can store / process them – everything contiguous, no wasted space. It also gives us the smallest possible memory footprint, and lets the RAM + Bus + CPU perform at top speed. It probably runs as fast or faster than any other engine architecture.
But … in reality, that’s uncommon or rare.
The hard case: One Processor reads/writes multiple Component-types
To see why, think about how we’d update the Positions. Perhaps a simple “Movement” Processor:
- Reads all Position components and all Velocity components
- (Position: { float: x, float y })
- (Velocity: { float: dx, float dy })
- Each tick, scales Velocity.dx by frame-time, and adds it to Position.x (and repeats for .dy / .y)
- Writes the results directly to the Position components
“Houston, we have a problem”
This is no longer possible with a single, purely homogeneous array. There are many ways we can go from here, but none of them are as trivial or efficient as the tight-packed array we had before.
Depending on our Data Structure, we may be able to make a semi-homogeneous array: one that alternates “Position, Velocity, Position, Velocity, …” – or even an array-of-structs, with a struct that wraps: “{ Position, Velocity }”.
…or maybe not. This is where most of our effort will go.
The third scenario: Cross-referencing
There’s one more case we need to consider. Some games (for instance) let you pick up items and store them in an inventory. ARGH!
…this gives us an association not between Components (which we could handle by putting them on the same Entity), but between Entities.
To act on this, one of our Processors will be iterating across contiguous memory and will suddenly (unpredictably) need to read/write the data for a different Entity (and probably a different ComponentType) elsewhere.
This is slow and problematic, but it only happens thousands of times per second … while the other cases happen millions of times (they have to read EVERYTHING, multiple times – once per Processor). We’ll optimize the main cases first, and I’ll leave this one for a later post.
Iterating towards a solution…
So … our common-but-difficult case is: Processors reading multiple Components in parallel. We need a good DS to handle this.
Iteration 1: a BigArray per ComponentType
The most obvious way forwards is to store the EntityID of each row into our Arrays, so that you can match rows from different Arrays.
If we have a lot of spare memory, instead of “tightly-packing” our data into Arrays, we can use the array-index itself as the EntityID. This works because our EntityID’s are defined as integers – the same as an array-index.
Usage algorithm:
- For iterating, we send the whole Array at once
- When a Processor needs to access N Components, we send it N * big-arrays
- For random access, we can directly jump to the memory location
- The Memory location is: (base address of Array) + (Component-size * EntityID)
- The base-address can easily be kept/cached with the CPU while iterating
- Bonus: Random access isn’t especially random; with some work, we could optimize it further
Problem 1: Blows the cache
This approach works for our “simple” scenario (1 Component / Processor). It seems to work for our “complex” case (multiple Components / Processor) – but in practice it fails.
We iterate through the Position array, and at each line we now have enough info to fetch the related row from the Velocity array. If both arrays are small enough to fit inside the CPU’s L1 cache (or at least the L2), then we’ll be OK.
Each instance is 500 bytes
Each BigArray has 20k entriesTotal: 10 MegaBytes per BigArray
This quickly overwhelms the caches (even an L3 Cache would struggle to hold a single BigArray, let alone multiple). What happens net depends a lot on both the algorithm (does it read both arrays on every row? every 10th row?), and the platform (how does the OS handle RAM reads when the CPU cache is overloaded?).
We can optimize this per-platform, but I’d prefer to avoid the situation.
Problem 2: memory usage
Our typeArray’s will need to be approimately 10 megabytes each:
For 1 Component type: 20,000 Entities * 50 variables * 8 bytes each = 8 MB
…and that’s not so bad. Smaller components will give smaller typeArrays, helping a bit. And with a maximum of 50 unique ComponentTypes, we’ve got an upper bound of 500 MB for our entire ES. On a modern desktop, that’s bearable.
But if we’re doing mobile (Apple devices in 2014 still ship with 512 MB RAM), we’re way too big. Or if we’re doing dynamic textures and/or geometry, we’ll lose a lot of RAM to them, and be in trouble even on desktop.
Problem 3: streaming cost
This is tied to RAM usage, but sometimes it presents a bottleneck before you run out of memory.
The data has to be streamed from RAM to the CPU. If the data is purely contiguous (for each component-type, it is!), this will be “fast”, but … 500 MB data / frame? DDR3 peaks around 10 Gigabytes / second, i.e.:
Peak frame rate: 20 FPS … divided by the number of Processors
1 FPS sound good? No? Oh.
Summary: works for small games
If you can reduce your entity count by a factor of 10 (or even better: 100), this approach works fine.
- Memory usage was only slightly too big; a factor of 10 reduction and we’re fine
- CPU caching algorithms are often “good enough” to handle this for small datasets
The current build of Aliqua is using this approach. Not because I like it, but because it’s extremely quick and easy to implement. You can get surprisingly far with this approach – MyEarth runs at 60 FPS on an iPad, with almost no detectable overhead from the ES.
Iteration 2: the Mega-Array of Doom
Even on a small game, we often want to burst up to 100,000+ Entities. There are many things we could do to reduce RAM usage, but our biggest problem is the de-contiguous data (multiple independent Arrays). We shot ourselves in the foot. If we can fix that, our code will scale better.
In an ideal world, the CPU wants us to interleave the components for each Entity. i.e. all the Components for a single Entity are adjacent in memory. When a Processor needs to “read from the Velocity and write to the Position”, it has both of them immediately to hand.
Problem 1: Interleaving only works for one set at a time
If we interleave “all Position’s with all Velocity’s”, we can’t interleave either of them with anything else. The Velocity’s are probably being generated by a different Processor – e.g. a Physics Engine – from yet another ComponentType.
So, ultimately, the mega-array only lets us optimize one Processor – all the rest will find their data scattered semi-randomly across the mega-array.
NB: this may be acceptable for your game; I’ve seen cases where one or two Processors accounted for most of the CPU time. The authors optimized the DS for one Processor (and/or had a duplicate copy for the other Processor), and got enough speed boost not to worry about the rest
Summary: didn’t really help?
The Mega Array is too big, and it’s too interconnected. In a lot of ways, our “lots of smaller arrays – one per ComponentType” was a closer fit. Our Processors are mostly independent of one another, so our ideal Data Structure will probably consist of multiple independent structures.
Perhaps there’s a halfway-house?
Iteration 3: Add internal structure to our MegaArray
When you use an Entity System in a real game, and start debugging, you notice something interesting. Most people start with an EntityID counter that increases by 1 each time a new Entity is created. A side-effect is that the layout of components on entities becomes a “map” of your source code, showing how it executed, and in what order.
e.g. With the Iteration-1 BigArrays, my Position’s array might look like this:
- First entity was an on-screen “loading” message, that needed a position
- BLANK (next entity holds info to say if loading is finished yet, which never renders, so has no position)
- BLANK (next entity is the metadata for the texture I’m loading in background; again: no position)
- Fourth entity is a 3d object which I’ll apply the texture to. I create this once the texture has finished loading, so that I can remove the “loading” message and display the object instead
- …etc
If the EntityID’s were generated randomly, I couldn’t say which Component was which simply by looking at the Array like this. Most ES’s generate ID’s sequentially because it’s fast, it’s easy to debug (and because “lastID++;” is quick to type ;)). But do they need to? Nope.
If we generate ID’s intelligently, we could impose some structure on our MegaArray, and simplify the problems…
- Whenever a new Entity is created, the caller gives a “hint” of the Component Types that entity is likely to acquire at some time during this run of the app
- Each time a new unique hint is presented, the EntitySystem pre-reserves a block of EntityID’s for “this and all future entities using the same hint”
- If a range runs out, no problem: we add a new range to the end of the MegaArray, with the same spec, and duplicate the range in the mini-table.
- Per frame, per Processor: we send a set of ranges within the MegaArray that are needed. The gaps will slow-down the RAM-to-CPU transfer a little – but not much
Problem 1: Heterogeneity
Problem 1 from the MegaArray approach has been improved, but not completely solved.
When a new Entity is created that intends to have Position, Velocity, and Physics … do we include it as “Pos, Vel”, “Pos, Phys” … or create a new template, and append it at end of our MegaArray?
If we include it as a new template, and insist that templates are authoritative (i.e. the range for “Pos, Vel” templates only includes Entities with those Components, and no others) … we’ll rapidly fragment our mini-table. Every time an Entity gains or loses a Component, it will cause a split in the mini-table range.
Alternatively, if we define templates as indicative (i.e. the range for “Pos, Vel” contains things that are usually, but not always Pos + Vel combos), we’ll need some additional info to remember precisely which entities in that range really do have Pos + Vel.
Problem 2: Heterogeneity and Fragmentation from gaining/losing Components
When an Entity loses a Component, or gains one, it will mess-up our mini-table of ranges. The approach suggested above will work … the mini-table will tend to get more and more fragmented over time. Eventually every range is only one item long. At that point, we’ll be wasting a lot of bus-time and CPU-cache simply tracking which Entity is where.
NB: As far as I remember, it’s impossible to escape Fragmentation when working with dynamic data-structures – it’s a fundamental side effect of mutable data. So long as our fragmentating problems are “small” I’ll be happy.
Problem 3: Heterogeneity and Finding the Components within the Array
If we know that “Entity 4” starts at byte-offset “2048”, and might have a Position and Velocity, that’s great.
But where do we find the Position? And the Velocity?
They’re at “some” offset from 2048 … but unless we know all the Components stored for Entity 4 … and what order they were appended / replaced … we have no idea which. Raw array-data is typeless by nature…
Iteration 4: More explicit structure; more indexing tables
We add a table holding “where does each Entity start”, and tables for each Component stating “offset for that Component within each Entity”. Conveniently, this also gives us a small, efficient index of “which Entities have Component (whatever)”:
Problem 1: non-contiguous data!
To iterate over our gameobjects, we now need:
- One big mega-array (contiguous)
- N x mini arrays (probably scattered around memory)
Back to square one? Not quite – the mini-arrays are tiny. If we assume a limit of 128,000 entities, and at most 8kb of data for all Components on an Entity, our tables will be:
[ID: 17bits][Offset: 13 bits] = 30 bits per Component
…so that each mini-array is 1-40 kB in size. That’s small enough that several could fit in the cache at once.
Good enough? Maybe…
At this point, our iterations are quite good, but we’re seeing some recurring problems:
- Re-allocation of arrays when Components are added/removed (I’ve not covered this above – if you’re not familiar with the problem, google “C dynamic array”)
- Fragmentation (affects every iteration after Iteration 1, which doesn’t get any worse simple because it’s already as bad as it could be)
- Cross-referencing (which I skipped)
I’ve also omitted history-tracking – none of the DS’s above facilitate snapshots or deltas of game state. This doesn’t matter for e.g. rendering – but for e.g. network code it becomes extremely important.
There’s also an elephant in the room: multi-threaded access to the ES. Some ES’s, and ES-related engines (*cough*Unity*cough*), simply give-up on MT. But the basis of an ES – independent, stateless, batch-oriented programming – is perfect for multi threading. So there must be a good way of getting there…
…which gives me a whole bunch of things to look at in future posts :).
PS … thanks to:
Writing these things takes ages. So much to say, so hard to keep it concise. I inflicted early drafts of this on a lot of people, and I wanted to say “thanks, guys” :). In no particular order (and sorry in advance if final version cut bits you thought should be in there, or vice versa): TCE’ers (especially Dog, Simon Cooke, doihaveto, archangelmorph, Hypercube, et al), ADB’ers (Amir Ebrahimi, Yggy King, Joseph Simons, Alex Darby, Thomas Young, etc). Final edit – and any stupid mistakes – are mine, but those people helped a lot with improving, simplifying, and explaining what I was trying to say.
29 replies on “Data Structures for Entity Systems: Contiguous memory”
It is hard to believe there is other solution than give up and use objects. Not as performance as primitive arrays, but it gives you less headaches.
What happens when, for example, you want to have Behaviour Trees for the artificial intelligence? Tons of objects linked among them, and every entity has one! I hope you talk about that too on your next post. Keep up the good work!
To be clear: this is aimed at super-fast ES’s. Something potentially usable in a AAA game (although AAA teams will be doing smarter / more complex DS’s than this, it’s at least in the right ballpark). In that case, you shouldn’t be storing fat data in Components.
You *can* do it – it’s fine (and hey – I do it quite often ;)) – but when you do, you should be aware it destroys any chance you had at top-notch performance. In many cases, you might not care (e.g. when prototyping, I will often wrap fat Classes from 3rd party libraries, knowing that I can replace them later with a more thoughtful, higher-performance representation *if I need to*).
An easy thing you can do to mitigate this, and get best of both worlds, is to have two parallel ES’s in memory – one with fat/low performance data, one with high-performance data. You lose some performance in having to maintain references across two spaces – i.e. “pairs” of entities where you would have had “only one entity” … but if it’s a case of giving your render thread a speed boost, and shunting the AI logic into the background, that might work for you.
Great article Adam – and thanks for the name drop :) Wish I had the time to try it all out…
Loved this article! My ES performance requirements aren’t yet at this level, but it’s still thought-provoking and informative to read about cache considerations. More like this, please!
Interesting article. Reflects precisely where traditional OO architectures conflict with embedded software design. I would suggest to take a look at database techniques to improve joint performance.
We frequently used hot/cold data partitions in our AAA RPG game engine, and hierarchical structures on top of that to clip entity groups as needed. Implementing hybrid approaches also works, loading sets of entities in a contiguous block to speed up processing, and offloading them when they are less frequently used. Another trick that can sometimes be used is to store action ID’s, use delta encoding instead of storing actual values. Maybe that height information does not need to be stored at all, maybe velocity can be mapped onto a fixed curve or plane and be indexed. The divide between memory and processor bandwidth has only increased and will continue to increase as processors and threads double in speed and numbers, so it makes sense to “store less and do more”. Almost always, ES architecture start with ‘good software design principles’, which is valuable. Usually though, entities end up as purely logically connected data, with the only thing binding them together being their handle or id. The vector processing nature of embedded devices and the performance advantage of contiguous memory both drive this shift.
@Starshine thanks, lots of interesting things you mention there. Did you guys blog any of it, or give conference talks?
Lots of interesting stuff in this article and it’s nice to see some discussion on ES implementation details.
I’m starting to think the biggest problem with Entity Systems is managing data fragmentation.
Like sparse arrays for example, it’s the fragmentation within the array that will cause performance problems not so much having multiple arrays. As accessing more than one array isn’t a problem for a cache as long as the data is accessed in a predictable and linear fashion or at least close to it. Which should be the case for a batch processing system. Well, at least in theory :)
In practice sparse arrays and similar data structures seem to be good enough, so far.
@DocLazy – well, fragmentation is one of those “great, unsolved problems of practical Computer Science”, isn’t it? :). Whatever you do, sooner or later, a long-running system that uses real hardware hits fragmentation problems – or else pays a cost in constantly shunting bits to different physical locations.
(AFAIAA – maybe there’s been great leaps and bounds in this area, and someone has a commodity-hardware solution by now?)
But there’s also a whole lot to say about read/write conflicts and threading. Batch systems should be able to run 100% independently – and if they can, lots of easy optimizations exist. But as presented: they won’t. This is what I’d like to write about next (when I get some time!)
Although fragmentation is probably not the right word for what I was thinking, fragmentation is a fundamental problem and I’m probably over thinking things a little here. But more likely it’s easier to see and understand what/how the data is being processed with a data driven entity system. In fact it turns out the classical definition of OO, all data and code in the same package is almost the worst possible organisation as far as the cache is concerned. Sony did a paper on this: http://research.scee.net/files/presentations/gcapaustralia09/Pitfalls_of_Object_Oriented_Programming_GCAP_09.pdf And it can get even worse with multi threading. For example two different threads writing to two different variables, but those variables happen to belong to the same object and therefore sit on the same line of cache, uh oh! Actually something similar could happen when interleaving components. Which I’m guessing is what your next article will be about.
I think spatial locality is the correct term I was looking for. Components often have big jumps/gaps between entity ids. Which I don’t think you can really fix with better data structures. I think the easier approach may be a smarter entity manager. Something simple like similar entities being allocated with in a range of ids. For example static entities having 1/3 the ids, dynamic entities having 1/3 and the other third for random junk entities. One bonus of doing this is that you can take a guess at how big a component array will need to be by which group it belongs to instead of having to allocate enough for the entire entity range.
This all sounds like we need to implement our own relational database in code with all this high-performant features, which are already inside of rdbms (mssql, mysql etc.)
“We iterate through the Position array, and at each line we now have enough info to fetch the related row from the Velocity array. If both arrays are small enough to fit inside the CPU’s L1 cache (or at least the L2), then we’ll be OK.”
At the risk of making an idiot of myself, I don’t think the above is right. I’m pretty sure the prefetcher has no idea how big an array is, it just fetches cache lines that it thinks you’ll need – the documentation for the PREFETCH instruction is an indicator of this. I wrote a test specifically for this a while ago when I first started learning about cache coherence. I iterated over and processed data in 2 arrays which were each as big as my L1 cache. I then did the same test but on 1 long array, and didn’t see a noticeable difference in performance.
I was referring to the theoretical “win” situation … and pointing out we’re not going to achieve it :).
Fetching is highly CPU/bus/arch/compiler dependent, so I expect different outcomes from tests on different hardware here. In general, though, you ought to be able to get both arrays to fit into one cache, and pull data from both at once. The main point is that when you pull line from array ONE, you need no additional info to pull the same line from array TWO (i.e. you need no other data structures in order to decide which index to use from array TWO).
…with other layouts, you need extra info, and then your chances of fitting all that into the cache too … approach zero.
(at least, I thikn that’s what I was saying)
“In general, though, you ought to be able to get both arrays to fit into one cache, and pull data from both at once.”
I might be missing what you’re saying here, but you don’t need to get both arrays in to the cache at once. If you’re accessing a[i] and b[i] in a sequential fashion, the prefetcher will notice what you’re doing, and fetch lines from both a and b as you go. Even if they *did* both fit in the cache at once, they won’t both be in the cache in their entirety until you’re nearing the end of the array. Unless of course, I’ve got it completely arse-backwards…
For context, this was in reference to having a big array for each component type, entity ID is the index in the array, and entities which don’t have component x attached to them, will just be a waste of memory in the array. You said a problem with this would be that it blows the cache if your system operates on component A and B, and both of the component arrays can’t fit in the cache at the same time.
[…] Comment optimiser un système ECS pour le cache du CPU […]
Nice article, I’ve come up with a solution very similar with yours, with several minor differences.
The approach I’m using:
1. All components in separated array, no interleaved data.
2. There is an ‘entity id’ -> ‘component index’ mapping for each component array. (just like yours)
For some of your problems, I’ve some solutions, not fancy, but worked for me:
* Re-allocation of arrays when Components are added/removed
I always over-allocate to minimize re-allocation, most of the time it’s just ‘length += 1’ for adding and item, and for removing I’m using the old ‘swap’ trick, as long as I have an ‘entity id’ -> ‘component offset’ mapping already, I can just swap the item with the last one in the array, ‘length -= 1’, and update the mapping to point to the new index/offset.
Of cause this does not work if the component needs to be ordered, but it covers most use cases.
* Fragmentation
I assume you mean the ‘holes’ between component data. For un-ordered data the swap trick make the array ‘compact’ always.
For ordered data, I’m using a bitmap (bit-array) to flag the ‘hole’, and skip them when looping. When ‘holes’ reach a certain percentage, I’ll run a insertion-sort on the array to ‘compact’ it again, since the array is already almost-sorted, this is very fast.
* Multi-threading
Yes, data-parallism applies here perfectly.
I have a job system (a thread pool and a dispatcher), as long as the component data has no internal dependency, I just split the data array evenly , and create a job for each range, then throw the jobs to the dispatcher to utilize all CPU cores, a barrier then is used (or not, depending on the context) to sync the jobs when all of them is done. This is very similar to OpenMP.
@ccll
“length += 1” – so you do a full array-reallocation-with-copy (with all the overhead that implies) each time you add a component anywhere?
@adam
Sorry for not clear enough, by ‘over-allocate’ I mean ‘pre-allocate’ a larger capacity than you current need.
For e.g. I have 1000 component items for now (length == 1000), but my array has a capcity of 1300 (capacity == 1300), so when adding a component item, I copy the value to comp[length], then ‘length+=1’, just like what ‘std::vector’ does in C++.
Actually I’m using another approach in real world, just setup an upper limit (N) of entities in the system, pre-allocate a big buffer enough to hold N items when loading the level, the N can be computed at runtime so that you can make proper use of available resource. In this way I only allocate once and don’t expand/shrink the data buffer in game.
[…] a year and a half since I wrote about choosing Data Structures for high-performance Entity Systems in A-AAA games. I always had a followup post in mind, but never wrote it down (thanks largely to the demise of […]
How would a Iteration 4 be implemented in C++ without using pointers? It seems to me in that language you can only use Iteration 1, or an array of structs holding all of a single entity’s components.
Would it be fine if I did a Iteration 1 but tightly packed, and then store a vector of component indices on an entity?
I don’t really understand your question.
Obviously, you don’t need pointers.
Obviously, you would probably use pointers anyway.
So…?
What I was trying to ask is, how do I store components of different types on the same array/vector without using pointers, in C++?
That’s more of a question for StackOverflow. You have a bunch of options.
How did you get that 1-40 kB size for the mini-arrays? Wouldn’t it be 0-480kB per mini-array?
[…] type of info construction explained in blog : t-machine.org @ iteration […]
I just wanted to revisit our discussion on cache coherency from over a year ago now that I understand the subject a little better than I did before!
So with iteration 1 (i.e. array of A components, array of B components, entity ID as index), as far as I can see this won’t blow the cache as long as the arrays of A and B are accessed sequentially, and given the entity ID is the index I don’t see why we wouldn’t. We may waste some memory in our cache lines, as we may load neighbouring components in the same cache line which we don’t actually use (e.g. if that entity only has A or B but not both). Have I understood your point correctly?
Obviously the memory usage thing is pretty bad, so one might opt for dense arrays of A and B, and a third array which ties indicies in A/B to a single entity. This approach though, could indeed lead to cache misses.
Guess you don’t read these any more :P
More an issue of limited time, and the context-switch cost of trying to bring all the context of this post from a long time ago back into my head, sorry.
I’m working on some more detailed posts on the topic, and more likely to revisit this after some are published.
Briefly “waste some memory in cache” is usually enough with non-trivial games to effectively blow the cache. When you have hundreds of components … arrays in different parts of memory …
No worries, you have many posts, and more comments, not sure why I got so impatient, I blame the caffeine :P
Fair enough, I guess one mans cache-inefficiency is another mans blowing-the-cache.
Anywho, I think the statement “If both arrays are small enough to fit inside the CPU’s L1 cache (or at least the L2), then we’ll be OK” sounds like a misunderstanding of how caches work. If you’re accessing the arrays in a linear and predictable fashion, there is no performance benefit to fitting them in the cache in their entirety. It’s only if you’re planning to jump around randomly that it would matter.
Yep, not sure what I was saying there … maybe: it was only a starter article, so I was trying to plough through a lot of ideas quickly, and convey the kind of thinking processes devs need to go through with these questions, and let them work the rest out themselves…