How 3D Game Rendering Works, A Deeper Dive: Rasterization and Ray Tracing
In this second part of our deeper look at 3D game rendering, we’ll be focusing what happens to the 3D world after all of the vertex processing has finished. We’ll need to dust off our math textbooks again, grapple with the geometry of frustums, and ponder the puzzle of perspectives. We’ll also take a quick dive into the physics of ray tracing, lighting and materials — excellent!
The main topic of this article is about an important stage in rendering, where a three dimensional world of points, lines, and triangles becomes a two dimensional grid of colored blocks. This is very much something that just ‘happens’, as the processes involved in the 3D-to-2D change occur unseen, unlike with our previous article where we could immediately see the effects of vertex shaders and tessellation. If you’re not ready for all of this, don’t worry — you can get started with our 3D Game Rendering 101. But once you’re set, read on our for our next look at the world of 3D graphics.
Getting ready for 2 dimensions
The vast majority of you will be looking at this website on a totally flat monitor or smartphone screen; even if you’re cool and down with the kids, and have a fancy curved monitor, the images it’s displaying consist of a flat grid of colored pixels. And yet, when you’re playing the latest Call of Mario: Deathduty Battleyard, the images appear to be 3 dimensional. Objects move in and out of the environment, becoming larger or smaller, as they move to and from the camera.
Using Bethesda’s Fallout 4 from 2014 as an example, we can easily see how the vertices have been processed to create the sense of depth and distance, especially if run it in wireframe mode (above).
If you pick any 3D game of today, or the past 2 decades, almost every single one of them will perform the same sequence of events to convert the 3D world of the vertices into the 2D array of pixels. The name for the process that does the change often gets called rasterization but that’s just one of the many steps in the whole shebang.
We’ll need to break down the some of various stages and examine the techniques and math employed, and for reference, we’ll use the sequence as used by Direct3D, to investigate what’s going on. The image below sets out what gets done to each vertex in the world:
We saw what was done in the world space stage in our Part 1 article: here the vertices are transformed and colored in, using numerous matrix calculations. We’ll skip over the next section because all that happens for camera space is that the transformed vertices are adjusted after they’ve been moved, to make the camera the reference point.
The next steps are too important to skip, though, because they are absolutely critical to making the change from 3D to 2D — done right, and our brains will look at a flat screen but ‘see’ a scene that has depth and scale — done wrong, and things will look very odd!
It’s all a matter of perspective
The first step in this sequence involves defining the field of view, as seen by the camera. This is done by first setting the angles for the horizontal and vertical field of views — the first one can often be changed in games, as humans have better side-to-side peripheral vision compared to up-and-down.
We can get a sense of this from this image that shows the field of human vision:
The two field of view angles (fov, for short) define the shape of a frustum – a 3D square-based pyramid, that emanates from the camera. The first angle is for the vertical fov, the second being the horizontal one; we’ll use the symbols α and β to denote them. Now we don’t quite see the world in this way, but it’s computationally much easier to work out a frustum, rather than trying to generate a realistic view volume.
Two other settings need to be defined as well — the position of the near (or front) and far (back) clipping planes. The former slices off the top of the pyramid but essentially determines how close to the position of the camera that anything gets drawn; the latter does the same but defines how far away from the camera that any primitives are going to be rendered.
The size and position of the near clipping plane is important, as this becomes what is called the viewport. This is essentially what you see on the monitor, i.e. the rendered frame, and in most graphics APIs, the viewport is ‘drawn’ from its top left-hand corner. In the image below, the point (a1, b2) would be the origin of the plane, and the width and the height of the plane are measured from here.
The aspect ratio of the viewport is not only crucial to how the rendered world will appear, it also has to match the aspect ratio of the monitor. For many years, this was always 4:3 (or 1.3333… as a decimal value). Today though, many of us game with ratios such as 16:9 or 21:9, aka widescreen and ultra widescreen.
The coordinates of each vertex in the camera space need to be transformed so that they all fit onto the near clipping plane, as shown below:
The transformation is done by use of another matrix — this particular one is called the perspective projection matrix. In our example below, we’re using the field of view angles and the positions of the clipping planes to do the transformation; we could use the dimensions of the viewport instead though.
The vertex position vector is multiplied by this matrix, giving a new set of transformed coordinates.
Et voila! Now we have all our vertices written in such a way that the original world now appears as a forced 3D perspective, so primitives near to the front clipping plane appear bigger than those nearer the far plane.
Although the size of the viewport and the field of view angles are linked, they can be processed separately — in other words, you could have the frustum set to give you a near clipping plane that’s different in size and aspect ratio to the viewport. For this to happen, an additional step is required in the chain, where the vertices in the near clipping plane need to be transformed again, to account for the difference.
However, this can lead to distortion in the viewed perspective. Using Bethesda’s 2011 game Skyrim, we can see how adjusting the horizontal field of view angle β, while retaining the same viewport aspect ratio, has a significant effect on the scene:
In this first image, we’ve set β = 75° and the scene appears perfectly normal. Now let’s try it with β = 120°:
Two differences are immediately obvious — first of all, we can now see much more to the sides of our ‘vision’ and secondly, objects now seem much further away (the trees especially). However, the visual effect of the water surface doesn’t look right now, and this is because the process wasn’t designed for this field of view.
Now let’s assume our character has eyes like an alien and set β = 180°!
This field of view does give us an almost panoramic scene but at a cost to a serious amount of distortion to the objects rendered at the edges of the view. Again, this is because the game designers didn’t plan and create the game’s assets and visual effects for this view angle (the default value is around 70°).
It might look as if the camera has moved in the above images, but it hasn’t — all that has happened is that the shape of the frustum was altered, which in turn reshaped the dimensions of the near clipping plane. In each image, the viewport aspect ratio has remained the same, so a scaling matrix was applied to the vertices to make everything fit again.
So, are you in or out?
Once everything has been correctly transformed in the projection stage, we then move on to what is called clip space. Although this is done after projection, it’s easier to visualize what’s going on if we do it before:
In our above diagram, we can see that the rubber ducky, one of the bats, and some of the trees will have triangles inside the frustum; however, the other bat, the furthest tree, and the panda are all outside the frustum. Although the vertices that make up these objects have already been processed, they’re not going to be seen in the viewport. That means they get clipped.
In frustum clipping, any primitives outside the frustum are removed entirely and those that lie on any of the boundaries are reshaped into new primitives. Clipping isn’t really much of a performance boost, as all the non-visible vertices have been run through vertex shaders, etc. up to this point. The clipping stage itself can also be skipped, if required, but this isn’t supported by all APIs (for example, standard OpenGL won’t let you skip it, whereas it is possible to do so, by use of an API extension).
It’s worth noting that the position of the far clipping plane isn’t necessarily the same as draw distance in games, as the latter is controlled by the game engine itself. Something else that the engine will do is frustum culling — this is where code is run to determine if an object is going to be within the frustum and/or affect anything that is going to be visible; if the answer is no, then that object isn’t sent for rendering. This isn’t the same as frustrum clipping, as although primitives outside the frustrum are dropped, they’ve still been run through the vertex processing stage. With culling, they’re not processed at all, saving quite a lot of performance.
Now that we’ve done all our transformation and clipping, it would seem that the vertices are finally ready for the next stage in the whole rendering sequence. Except, they’re not. This is because all of the math that’s carried out in the vertex processing and world-to-clip space operations has to be done with a homogenous coordinate system (i.e. each vertex has 4 components, rather than 3). However, the viewport is entirely 2D, and so the API expects the vertex information to just have values for x, y (the depth value z is retained though).
To get rid of the 4th component, a perspective division is done where each component is divided by the w value. This adjustment locks the range of values x and y can take to [-1,1] and z to the range of [0,1] — these are called normalized device coordinates (NDCs for short).
If you want more information about what we’ve just covered, and you’re happy to dive into a lot more math, then have a read of Song Ho Ahn’s excellent tutorial on the subject. Now let’s turn those vertices into pixels!
Master that raster
As with the transformations, we’ll stick to looking at how Direct3D sets the rules and processes for making the viewport into a grid of pixels. This grid is like a spreadsheet, with rows and columns, where each cell contains multiple data values (such as color, depth values, texture coordinates, etc). Typically, this grid is called a raster and the process of generating it is known as rasterization. In our 3D rendering 101 article, we took a very simplified view of the procedure:
The above image gives the impression that the primitives are just chopped up into small blocks, but there’s far more to it that that. The very first step is to figure out whether or not a primitive actually faces the camera — in an image earlier in this article, the one showing the frustrum, the primitives making up the back of the grey rabbit, for example, wouldn’t be visible. So although they would be present in the viewport, there’s no need to render them.
We can get a rough sense of what this looks like with the following diagram. The cube has gone through the various transforms to put the 3D model into 2D screen space and from the camera’s view, several of the cube’s faces aren’t visible. If we assume that none of the surfaces are transparent, then several of these primitives can be ignored.
In Direct3D, this can be achieved by telling the system what the render state is going to be, and this instruction will tell it to remove (aka cull) front facing or back facing sides for each primitive (or to not cull at all — for example, wireframe mode). But how does it know what is front or back facing? When we looked at the math in vertex processing, we saw that triangles (or more a case of the vertices) have normal vectors which tell the system which way its facing. With that information, a simple check can be done, and if the primitive fails the check, then it’s dropped from the rendering chain.
Next, it’s time to start applying the pixel grid. Again, this is surprisingly complex, because the system has to work out if a pixel fits inside a primitive — either completely, partially, or not at all. To do this, a process called coverage testing is done. The image below shows how triangles are rasterized in Direct3D 11:
The rule is quite simple: a pixel is deemed to be inside a triangle if the pixel center passes what Microsoft call the ‘top left’ rule. The ‘top’ part is a horizontal line check; the pixel center must be on this line. The ‘left’ part is for non-horizontal lines, and the pixel center must fall to the left of such a line. There are additional rules for non-primitives, i.e. simple lines and points, and the rules gain extra conditions if multisampling is employed.
If we look carefully at the image from Microsoft’s documentation, we can see that the shapes created by the pixels don’t look very much like the original primitives. This is because the pixels are too big to create a realistic triangle — the raster contains insufficient data about the original objects, leading to an issue called aliasing.
Let’s use UL Benchmark’s 3DMark03 to see aliasing in action:
In the first image, the raster was set to a very low 720 by 480 pixels in size. Aliasing can be clear seen on the handrail and the shadow cast the gun held by the top soldier. Compare this to what you get with a raster that has 24 times more pixels:
Here we can see that the aliasing on the handrail and shadow has completely gone. A bigger raster would seem to be the way to go every time but the dimensions of the grid has to be supported by the monitor that the frame will displayed on and given that those pixels have to be processed, after the rasterization process, there is going to be an obvious performance penalty.
This is where multisampling can help and this is how it functions in Direct3D:
Rather than just checking if a pixel center meets the rasterization rules, multiple locations (called sub-pixel samples or subsamples) within each pixel are tested instead, and if any of those are okay, then that whole pixel forms part of the shape. This might seem to have no benefit and possibly even make the aliasing worse, but when multisampling is used, the information about which subsamples are covered by the primitive, and the results of the pixel processing, are stored in a buffer in memory.
This buffer is then used to blend the subsample and pixel data in such a way that the edges of the primitive are less blocky. We’ll look at the whole aliasing situation again in a later article, but for now, this is what multisampling can do when used on a raster with too few pixels:
We can see that the amount of aliasing on the edges of the various shapes has been greatly reduced. A bigger raster is definitely better, but the performance hit can favor the use of multisampling instead.
Something else that can get done in the rasterization process is occlusion testing. This has to be done because the viewport will be full of primitives that will be overlapping (occluded) — for example, in the above image, the front facing triangles that make up the solider in the foreground overlap the same triangles in the other soldier. As well as checking if a primitive covers a pixel, the relative depths can be compared, too, and if one is behind the other, then it could be skipped from the rest of rendering process.
However, if the near primitive is transparent, then the further one would still be visible, even though it has failed the occlusion check. This is why nearly all 3D engines do occlusion checks before sending anything to the GPU and instead creates something called a z-buffer as part of the rendering process. This is where the frame is created as normal but instead of storing the final pixel colors in memory, the GPU stores just the depth values. This can then be used in shaders to check visibility with more control and precision over aspects involving object overlapping.
In the above image, the darker the color of the pixel, the closer that object is to the camera. The frame gets rendered once, to make the z buffer, then is rendered again but this time when the pixels get processed, a shader is run to check them against the values in the z buffer. If it isn’t visible, then that pixel color isn’t put into the final frame buffer.
For now, the main final step is to do vertex attribute interpolation — in our initial simplified diagram, the primitive was a complete triangle, but don’t forget that the viewport is just filled with the corners of the shapes, not the shape itself. So the system has to work out what the color, depth, and texture of the primitive is like in between the vertices, and this is called interpolation. As you’d imagine this is another calculation, and not a straightforward one either.
Despite the fact that the rasterized screen is 2D, the structures within it are representing a forced 3D perspective. If the lines were truly 2 dimensional, then we could use a simple linear equation to work out the various colors, etc as we go from one vertex to another. But because of the 3D aspect to the scene, the interpolation needs to account for the perspective — have a read of Simon Yeung’s superb blog on the subject to get more information on the process.
So there we go — that’s how a 3D world of vertices becomes a 2D grid of colored blocks. We’re not quite done, though.
It’s all back to front (except when it’s not)
Before we finish off our look at rasterization, we need to say something about the order of the rendering sequence. We’re not talking about where, for example, tessellation comes in the sequence; instead, we’re referring to the order that the primitives get processed. Objects are usually processed in the order that they appear in the index buffer (the block of memory that tells the system how the vertices are grouped together) and this can have a significant impact on how transparent objects and effects are handled.
The reason for this is down to the fact that the primitives are handled one at a time and if you render the ones in the front first, any of those behind them won’t be visible (this is where occlusion culling really comes into play) and can get dropped from the process (helping the performance) — this is generally called ‘front-to-back’ rendering and requires the index buffer to be ordered in this way.
However, if some of those primitives right in front of the camera are transparent, then front-to-back rendering would result in the objects behind the transparent one to missed out. One solution is to render everything back-to-front instead, with transparent primitives and effects being done last.
So all modern games do back-to-front rendering, yes? Not if it can be helped — don’t forget that rendering every single primitive is going to have a much larger performance cost compared to rendering just those that can be seen. There are other ways of handling transparent objects, but generally speaking, there’s no one fits-all solution and every situation needs to be handled uniquely.
This essentially summarises the pros and cons to rasterization — on modern hardware, it’s really fast and effective, but it’s still an approximation of what we see. In the real world, every object will absorb, reflect and maybe refract light, and all of this has an effect on the viewed scene. By splitting the world into primitives and then only rendering some of them, we get a fast but rough result.
If only there was another way…
There is another way: Ray tracing
Almost five decades ago, a computer scientist named Arthur Appel worked out a system for rendering images on a computer, whereby a single ray of light was cast in a straight line from the camera, until it hit an object. From there, the properties of the material (its color, reflectiveness, etc) would then modify the intensity of the light ray. Each pixel in the rendered image would have one ray cast and an algorithm would be performed, going through a sequence of math to work out the color of the pixel. Appel’s process became known as ray casting.
About 10 years later, another scientist called John Whitted developed a mathematical algorithm that did the same as Appel’s approach, but when the ray hit an object, it would then generate additional rays, which would fire off in various directions depending the object’s material. Because this system would generate new rays for each object interaction, the algorithm was recursive in nature and so was computationally a lot more difficult; however, it had a significant advantage over Appel’s method as it could properly account for reflections, refraction, and shadowing. The name for this procedure was ray tracing (strictly speaking, it’s backwards ray tracing, as we follow the ray from the camera and not from the objects) and it has been the holy grail for computer graphics and movies ever since.
The name for this procedure was ray tracing (strictly speaking, it’s backwards ray tracing, as we follow the ray from the camera and not from the objects) and it has been the holy grail for computer graphics and movies ever since.
In the above image, we can get a sense of Whitted’s algorithm works. One ray is cast from the camera, for each pixel in the frame, and travels until it reaches a surface. This particular surface is translucent, so light will reflect off and refract through it. Secondary rays are generated for both cases, and these travel off until they interact with a surface. There are additional secondary, to account for the color of the light sources and the shadows they make, are also generated.
The recursive part of the process is that secondary rays can be generated every time a newly cast ray intersects with a surface. This could easily get out of control, so the number of secondary rays generated is always limited. Once a ray path is complete, its color at each terminal point is calculated, based on the material properties of that surface. This value is then passed down the ray to the preceding one, adjusting the color for that surface, and so on, until we reach the effective starting point of the primary ray: the pixel in the frame.
This can be hugely complex and even simple scenarios can generate a barrage of calculations to run through. There are, fortunately, some things can be done to help — one would be to use hardware that is specifically design to accelerate these particular math operations, just like there is for doing the matrix math in vertex processing (more on this in a moment). Another critical one is to try and speed up the process that’s done to work out what object a ray hits and where exactly on the object’s surface that the intersect occurs at — if the object is made from a lot of triangles, this can be surprisingly hard to do:
Rather than test every single triangle, in every single object, a list of bounding volumes (BV) is generated before ray tracing — these are nothing more than cuboids that surrounds the object in question, with successively smaller ones generated for the various structures within the object.
For example, the first BV would be for the whole rabbit. The next couple would cover its head, legs, torso, tail, etc; each one of these would then be another collection of volumes for the smaller structures in the head, etc, with the final level of volumes containing a small number of triangles to test. All of these volumes are then arranged in an ordered list (called a BV hierarchy or BVH for short) such that the system checks a relatively small number of BVs each time:
Although the use of a BVH doesn’t technically speed up the actual ray tracing, the generation of the hierarchy and the subsequent search algorithm needed, is generally much faster than having to check to see if one ray intersects with one out of millions of triangles in a 3D world.
Today, programs such as Blender and POV-ray utilize ray tracing with additional algorithms (such as photon tracing and radiosity) to generate highly realistic images:
The obvious question to ask is if ray tracing is so good, why don’t we use it everywhere? The answers lies in two areas: first of all, even simple ray tracing generates millions of rays that have to be calculated over and over. The system starts with just one ray per screen pixel, so at a resolution of just 800 x 600, that generates 480,000 primary rays and then each one generates multiple secondary rays. This is seriously hard work for even today’s desktop PCs. The second issue is that basic ray tracing isn’t actually very realistic and that a whole host of extra, very complex equations need to be included to get it right.
Even with modern PC hardware, the amount of work required is beyond the scope to do this in real-time for a current 3D game. In our 3D rendering 101 article, we saw in a ray tracing benchmark that it took tens of seconds to produce a single low resolution image.
So how was the original Wolfenstein 3D doing ray casting, way back in 1992, and why do the likes of Battlefield V and Metro Exodus, both released in 2019, offer ray tracing capabilities? Are they doing rasterization or ray tracing? The answer is: a bit of both.
The hybrid approach for now and the future
In March 2018, Microsoft announced a new API extension for Direct3D 12, called DXR (DirectX Raytracing). This was a new graphics pipeline, one to complement the standard rasterization and compute pipelines. The additional functionality was provided through the introduction of the shaders, data structures, and so on, but didn’t require any specific hardware support — other than that already required for Direct3D 12.
At the same Game Developers Conference, where Microsoft talked about DXR, Electronic Arts talked about their Pica Pica Project — a 3D engine experiment that utilized DXR. They showed that ray tracing can be used, but not for the full rendering frame. Instead, traditional rasterization and compute shader techniques would be used for the bulk of the work, with DXR employed for specific areas — this means that the number of rays generated is far smaller than it would be for a whole scene.
This hybrid approach had been used in the past, albeit to a lesser extent. For example, Wolfenstein 3D used ray casting to work out how the rendered frame would appear, although it was done with one ray per column of pixels, rather than per pixel. This still might seem to be very impressive, until you realize that the game originally ran at a resolution of 640 x 480, so no more than 640 rays were ever running at the same time.
The graphics card of early 2018 — the likes of AMD’s Radeon RX 580 or Nvidia’s GeForce 1080 Ti — certainly met the hardware requirements for DXR but even with their compute capabilities, there was some misgivings that they would be powerful enough to actually utilize DXR in any meaningful way.
This somewhat changed in August 2018, when Nvidia launched their newest GPU architecture, code-named Turing. The critical feature of this chip was the introduction of so-called RT Cores: dedicated logic units for accelerating ray-triangle intersection and bounding volume hierarchy (BVH) traversal calculations. These two processes are time consuming routines for working out where a light interacts with the triangles that make up various objects within a scene. Given that RT Cores were unique to the Turing processor, access to them could only be done via Nvidia’s proprietary API.
The first game to support this feature was EA’s Battlefield V and when we tested the use of DXR, we were impressed by the improvement to water, glass, and metal reflections in the game, but rather less so with the subsequent performance hit:
To be fair, later patches improved matters somewhat but there was (and still is) a big drop in the speed at which frames were being rendered. By 2019, some other games were appearing that supported this API, performing ray tracing for specific parts within a frame. We tested Metro Exodus and Shadow of the Tomb Raider, and found a similar story — where it was used heavily, DXR would notably affect the frame rate.
Around about the same time, UL Benchmarks announced a DXR feature test for 3DMark:
However, our examination of the DXR-enabled games and the 3DMark feature test proved one thing is certain about ray tracing: in 2019, it’s still seriously hard work for the graphics processor, even for the $1,000+ models. So does that mean that we don’t have any real alternative to rasterization?
Cutting-edge features in consumer 3D graphics technology are often very expensive and the initial support of new API capabilities can be rather patchy or slow (as we found when we tested Max Payne 3 across a range of Direct3D versions circa 2012) — the latter is commonly due to game developers trying include as many of the enhanced features as possible, sometimes with limited experience of them.
But where vertex and pixel shaders, tesselation, HDR rendering, and screen space ambient occlusion were once all highly demanding, suitable for top-end GPUs only, their use is now commonplace in games and supported by a wide range of graphics cards. The same will be true of ray tracing and given time, it will just become another detail setting that becomes enabled by default for most users.
Some closing thoughts
And so we come to the end of our second deep dive, where we’ve taken a deeper look into the world of 3D graphics. We’ve looked at how the vertices of models and worlds are shifted out of 3 dimensions and transformed into a flat, 2D picture. We saw how field of view settings have to be accounted for and what effect they produce. The process of making those vertices into pixels was explored, and we finished with a brief look at an alternative process to rasterization.
As before, we couldn’t possibly have covered everything and have glossed over a few details here and there — after all, this isn’t a textbook! But we hope you’ve gained a bit more knowledge along the way and have a new found admiration for the programmers and engineers who have truly mastered the math and science required to make all of this happen in your favorite 3D titles.
We’ll be more than happy to answer any questions you have, so feel free to send them our way in the comments section. Until the next one.
Masthead credit: Monochrome printing raster abstract by Aleksei Derin