Coder Social home page Coder Social logo

clustered-shading's Introduction

Clustered Shading

Clustered shading is a technique that allows for efficient rendering of thousands of dynamic lights in 3D perspective games. It can be integrated into both forward and deferred rendering pipelines with minimal intrusion.

Showcase

sponza_demo flat_demo (top) 512 lights (bottom) 1024 lights | both scenes rendered using clustered deferred on an Intel CometLake-H GT2 iGPU @60 fps

Overview

The "traditional" method for dynamic lighting is to loop over every light in the scene to shade a single fragment. But this is a huge limitation as there can be millions of fragments to shade on a modern display.

What if we could just loop over the lights we know will affect a given fragment? Enter clustered shading.

Important

Clustered shading divides the view frustum into 3D blocks (clusters) and assigns lights to each based on the light's influence. If a light is too far away, it is not visible to the cluster. Then in the shading step, a fragment retrieves the light list for the cluster it's in. This increases efficiency by only considering lights that are very likely to affect the fragment.

Clustered shading can be thought of as the "natural evolution" to traditional dynamic lighting. It's not a super well known technique. Since its introduction in 2012, clustered shading has mostly stayed in the realm of research papers and behind the doors of big game studios. My goal is to present a simple implementation with clear reasoning behind every decision. Something that might fit in a LearnOpenGL article.

We'll be using OpenGL 4.3 and C++. I'll assume you have working knowledge of both.

Tip

If you are viewing this in dark mode on GitHub, I recommend trying out light mode high contrast for easier reading. Text can appear blurry in dark mode.

Step 1: Splitting the view frustum into clusters

The view frustum and camera position (center of projection) form a pyramid like shape

The definition of the view frustum is the space between the zNear and zFar planes. This is the part the camera can "see". Shading is only done on fragments that are in the frustum.

Our goal is to divide this volume into a 3D grid of clusters. We'll define the clusters in view space so they are always relative to where the camera is.

Dvision Scheme

uniform division (left) and exponential division (right)

There are two main ways to distribute the frustum along the depth: uniform and exponential division.

The exponential division lets us cover the same area with fewer divisions. And we generally don't care if this causes a lot of lights to be assigned to those far out clusters. Because less of an object appears on the screen the further out in perspective projection, there are fewer fragments to shade.

So exponential division it is. We'll use the equation below which closely matches the image on the left.

$$\LARGE Z=\text{Near}_z\left(\frac{\text{Far}_z}{\text{Near}_z}\right)\Huge^{\frac{\text{slice}}{numslices}}$$
  • $\text{Near}_z$ and $\text{Far}_z$ represent the near and far planes
  • $\text{slice}$ is the current slice index
  • $\text{numslices}$ is the total number of slices to divide with.

This equation gives us the positive Z depth from the camera a slice should be. Where $Z$ is some value between the near and far planes.

Cluster Dimensions

In addition to slicing the frustum along the depth, we also need to divide each slice on the xy axis. What subdivision scheme to use is up to you. If your near and far planes are very far apart, you'll want more depth slices.

A good place to start is 16x9x24 (x, y, z-depth) which is what DOOM 2016 uses. I personally use 12x12x24 to show the division scheme can be anything you choose.

Cluster shape

The simplest way to represent the shape of the clusters is an AABB (Axis Aligned Bounding Box). Unfortunately, a side effect is that the AABBs must overlap to cover the frustum shape. This image shows that.

The points used to create the AABB cause overlapping boundaries

This still gives good results performance wise. You could choose a more accurate shape and improve shading time as lights are better assigned to their clusters. But what you're ultimately doing is trading faster shading for slower culling.

In fact, I'll make a bold claim: This algorithm does not benefit from more accurate cluster shapes or distributions of clusters. The reason being, there is not a lot of room for optimization without complicating and slowing down the cluster creation or culling step.

Implementation

We use a compute shader to build the cluster grid. This is all fully functioning code, taken straight from my OpenGL playground project.

GLSL
#version 430 core
layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;

struct Cluster
{
    vec4 minPoint;
    vec4 maxPoint;
    uint count;
    uint lightIndices[100];
};

layout(std430, binding = 1) restrict buffer clusterSSBO {
    Cluster clusters[];
};

uniform float zNear;
uniform float zFar;

uniform mat4 inverseProjection;
uniform uvec3 gridSize;
uniform uvec2 screenDimensions;

vec3 screenToView(vec2 screenCoord);
vec3 lineIntersectionWithZPlane(vec3 startPoint, vec3 endPoint, float zDistance);

void main()
{
    // Eye position is zero in view space
    const vec3 eyePos = vec3(0.0);

    uint tileIndex = gl_WorkGroupID.x + (gl_WorkGroupID.y * gridSize.x) +
            (gl_WorkGroupID.z * gridSize.x * gridSize.y);
    vec2 tileSize = screenDimensions / gridSize.xy;

    // calculate the min and max points of a tile in screen space
    vec2 minPoint_screenSpace = gl_WorkGroupID.xy * tileSize;
    vec2 maxPoint_screenSpace = (gl_WorkGroupID.xy + 1) * tileSize;

    // convert them to view space sitting on the near plane
    vec3 minPoint_viewSpace = screenToView(minPoint_screenSpace);
    vec3 maxPoint_viewSpace = screenToView(maxPoint_screenSpace);

    float tileNear =
        zNear * pow(zFar / zNear, gl_WorkGroupID.z / float(gridSize.z));
    float tileFar =
        zNear * pow(zFar / zNear, (gl_WorkGroupID.z + 1) / float(gridSize.z));

    // Find the 4 intersection points from a tile's min/max points to this cluster's
    // near and far planes
    vec3 minPointNear =
        lineIntersectionWithZPlane(eyePos, minPoint_viewSpace, tileNear);
    vec3 minPointFar =
        lineIntersectionWithZPlane(eyePos, minPoint_viewSpace, tileFar);
    vec3 maxPointNear =
        lineIntersectionWithZPlane(eyePos, maxPoint_viewSpace, tileNear);
    vec3 maxPointFar =
        lineIntersectionWithZPlane(eyePos, maxPoint_viewSpace, tileFar);

    vec3 minPointAABB = min(minPointNear, minPointFar);
    vec3 maxPointAABB = max(maxPointNear, maxPointFar);

    clusters[tileIndex].minPoint = vec4(minPointAABB, 0.0);
    clusters[tileIndex].maxPoint = vec4(maxPointAABB, 0.0);
}

// Returns the intersection point of an infinite line and a
// plane perpendicular to the Z-axis
vec3 lineIntersectionWithZPlane(vec3 startPoint, vec3 endPoint, float zDistance)
{
    vec3 direction = endPoint - startPoint;
    vec3 normal = vec3(0.0, 0.0, -1.0); // plane normal

    // skip check if the line is parallel to the plane.

    float t = (zDistance - dot(normal, startPoint)) / dot(normal, direction);
    return startPoint + t * direction; // the parametric form of the line equation
}
vec3 screenToView(vec2 screenCoord)
{
    // normalize screenCoord to [-1, 1] and
    // set the NDC depth of the coordinate to be on the near plane. This is -1 by
    // default in OpenGL
    vec4 ndc = vec4(screenCoord / screenDimensions * 2.0 - 1.0, -1.0, 1.0);

    vec4 viewCoord = inverseProjection * ndc;
    viewCoord /= viewCoord.w;
    return viewCoord.xyz;
}
C++
namespace Compute
{
constexpr unsigned int gridSizeX = 12;
constexpr unsigned int gridSizeY = 12;
constexpr unsigned int gridSizeZ = 24;
constexpr unsigned int numClusters = gridSizeX * gridSizeY * gridSizeZ;

struct alignas(16) Cluster
{
  glm::vec4 minPoint;
  glm::vec4 maxPoint;
  unsigned int count;
  unsigned int lightIndices[100];
};

unsigned int clusterGridSSBO;

void init_ssbos()
{
  // clusterGridSSBO
  {
    glGenBuffers(1, &clusterGridSSBO);
    glBindBuffer(GL_SHADER_STORAGE_BUFFER, clusterGridSSBO);

    // NOTE: we only need to allocate memory. No need for initialization because
    // comp shader builds the AABBs.
    glBufferData(GL_SHADER_STORAGE_BUFFER, sizeof(Cluster) * numClusters,
                 nullptr, GL_STATIC_COPY);
    glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, clusterGridSSBO);
  }
}

Shader clusterComp;

void cull_lights_compute(const Camera &camera)
{
  auto [width, height] = Core::get_framebuffer_size();

  // build AABBs every frame
  clusterComp.use();
  clusterComp.set_float("zNear", camera.near);
  clusterComp.set_float("zFar", camera.far);
  clusterComp.set_mat4("inverseProjection", glm::inverse(camera.projection));
  clusterComp.set_uvec3("gridSize", {gridSizeX, gridSizeY, gridSizeZ});
  clusterComp.set_uvec2("screenDimensions", {width, height});

  glDispatchCompute(gridSizeX, gridSizeY, gridSizeZ);
  glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);
}

void init()
{
  init_ssbos();
  // load shaders
  clusterComp = Shader("clusterShader.comp");
}

I encourage you to stop here and adapt this code into your game or engine and study it! See if you can spot the exponential formula from earlier.

We divide the screen into tiles and convert the points to view space sitting on the camera near plane. This essentially leaves us with a divided near plane. Then for each min and max point of a tile on the near plane, we draw a line from the origin through that point and intersect it with the cluster's near and far planes. The intersection points together form the bound of the AABB.

Note

screenDimensions is more accurately thought of as the dimensions of glViewport, under which, lighting will be done.

And a few notes on the C++ side:

  1. glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT); ensures the writes to the SSBO (Shader Storage Buffer Object) by the compute shader are visible to the next shader.

  2. alignas(16) is used to correctly match the C++ struct memory layout with how the SSBO expects it.

    According to the OpenGL Spec, the std430 memory layout requires the base alignment of an array of structures to be the base alignment of the largest member in the structure.

    struct alignas(16) Cluster
    {
      glm::vec4 minPoint; // 16 bytes
      glm::vec4 maxPoint; // 16 bytes
      unsigned int count; // 4 bytes
      unsigned int lightIndices[100]; // 400 bytes
    };

    The largest element in this struct is a vec4 of 16 bytes. Since we are storing an array of Cluster, the struct should have its memory aligned to 16 bytes. If you don't know what memory alignment is, don't worry. We basically need to add padding bytes to make the total struct size a multiple of 16 bytes. We can manually add some dummy variables or let the compiler handle it with alignas.

    If you are not storing an array of structures in std430, and as long as you stay away from vec3, you probably don't need to worry about alignment.

Step 2: Assigning Lights to Clusters (Culling)

Our goal now is to cull the lights by assigning them to clusters based on the light influence. The most common type of light used in games is the point light. A point light has a position and radius which define a sphere of influence.

We brute force test every light against every cluster. If there is an intersection between the sphere and AABB, the light is visible to the cluster, and it is appended to the cluster's local list.

Let's look at the cluster struct.

  struct Cluster
  {
    vec4 minPoint; // min point of AABB in view space
    vec4 maxPoint; // max point of AABB in view space
    uint count;
    uint lightIndices[100]; // elements point directly to global light list
  };
  • The min and max define the AABB of this cluster like before.

  • lightIndices contains the lights visible to this cluster. We hardcode a max of 100 lights visible to a cluster at any time. You'll see this number used a few times elsewhere. If you want to increase the number, make sure to change it everywhere.

  • count keeps tracks of how many lights are visible. It tells how much to read from the lightIndices array.

We'll use another compute shader to cull the lights. Compute shaders are just so awesome because they are general purpose. They are great for parallel tasks. In our case, testing intersection of thousands of lights against thousands of clusters.

Let's have each compute shader thread process a single cluster.

Is it really called a thread?
  • Strictly speaking, no, compute shaders don't use the term "thread" like CPUs. Compute shaders have workgroups and invocations. Each workgroup has its own invocations (called workgroup size or local_size). Each invocation is an independent execution of the main function. But it's helpful to think of invocations as threads.

    Read more about compute shaders on the OpenGL Wiki.

Implementation

GLSL
#version 430 core

#define LOCAL_SIZE 128
layout(local_size_x = LOCAL_SIZE, local_size_y = 1, local_size_z = 1) in;

struct PointLight
{
    vec4 position;
    vec4 color;
    float intensity;
    float radius;
};

struct Cluster
{
    vec4 minPoint;
    vec4 maxPoint;
    uint count;
    uint lightIndices[100];
};

layout(std430, binding = 1) restrict buffer clusterSSBO
{
    Cluster clusters[];
};

layout(std430, binding = 2) restrict buffer lightSSBO
{
    PointLight pointLight[];
};

uniform mat4 viewMatrix;

bool testSphereAABB(uint i, Cluster c);

// each invocation of main() is a thread processing a cluster
void main()
{
    uint lightCount = pointLight.length();
    uint index = gl_WorkGroupID.x * LOCAL_SIZE + gl_LocalInvocationID.x;
    Cluster cluster = clusters[index];

    // we need to reset count because culling runs every frame.
    // otherwise it would accumulate.
    cluster.count = 0;

    for (uint i = 0; i < lightCount; ++i)
    {
        if (testSphereAABB(i, cluster) && cluster.count < 100)
        {
            cluster.lightIndices[cluster.count] = i;
            cluster.count++;
        }
    }
    clusters[index] = cluster;
}

bool sphereAABBIntersection(vec3 center, float radius, vec3 aabbMin, vec3 aabbMax)
{
    // closest point on the AABB to the sphere center
    vec3 closestPoint = clamp(center, aabbMin, aabbMax);
    // squared distance between the sphere center and closest point
    float distanceSquared = dot(closestPoint - center, closestPoint - center);
    return distanceSquared <= radius * radius;
}

// this just unpacks data for sphereAABBIntersection
bool testSphereAABB(uint i, Cluster cluster)
{
    vec3 center = vec3(viewMatrix * pointLight[i].position);
    float radius = pointLight[i].radius;

    vec3 aabbMin = cluster.minPoint.xyz;
    vec3 aabbMax = cluster.maxPoint.xyz;

    return sphereAABBIntersection(center, radius, aabbMin, aabbMax);
}

Now let's update the C++ code. Mainly to create the lights. How exactly this is done is different for everyone. But the following suits the basic purpose.

C++

The important part is to create and fill the light SSBO. Note the use of alignas in the PointLight struct definition.

struct alignas(16) PointLight
{
  glm::vec4 position;
  glm::vec4 color;
  float intensity;
  float radius;
};

int main()
{

  std::mt19937 rng{std::random_device{}()};

  constexpr int numLights = 512;
  std::uniform_real_distribution<float> distXZ(-100.0f, 100.0f);
  std::uniform_real_distribution<float> distY(0.0f, 55.0f);

  std::vector<PointLight> lightList;
  lightList.reserve(numLights);
  for (int i = 0; i < numLights; i++)
  {
    PointLight light{};
    float x = distXZ(rng);
    float y = distY(rng);
    float z = distXZ(rng);

    glm::vec4 position(x, y, z, 1.0f);

    light.position = position;
    light.color = {1.0, 1.0, 1.0, 1.0};
    light.intensity = 1;
    light.radius = 5.0f;

    lightList.push_back(light);
  }

  glGenBuffers(1, &lightSSBO);
  glBindBuffer(GL_SHADER_STORAGE_BUFFER, lightSSBO);

  glBufferData(GL_SHADER_STORAGE_BUFFER, lightList.size() * sizeof(PointLight),
               lightList.data(), GL_DYNAMIC_DRAW);

  glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 2, lightSSBO);
}
namespace Compute
{
constexpr unsigned int gridSizeX = 12;
constexpr unsigned int gridSizeY = 12;
constexpr unsigned int gridSizeZ = 24;
constexpr unsigned int numClusters = gridSizeX * gridSizeY * gridSizeZ;

struct alignas(16) Cluster
{
  glm::vec4 minPoint;
  glm::vec4 maxPoint;
  unsigned int count;
  unsigned int lightIndices[100];
};

unsigned int clusterGridSSBO;

void init_ssbos()
{
  // clusterGridSSBO
  {
    glGenBuffers(1, &clusterGridSSBO);
    glBindBuffer(GL_SHADER_STORAGE_BUFFER, clusterGridSSBO);

    // NOTE: we only need to allocate memory. No need for initialization because
    // comp shader builds the AABBs.
    glBufferData(GL_SHADER_STORAGE_BUFFER, sizeof(Cluster) * numClusters,
                 nullptr, GL_STATIC_COPY);
    glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, clusterGridSSBO);
  }
}

Shader clusterComp;
Shader cullLightComp;

void cull_lights_compute(const Camera &camera)
{
  auto [width, height] = Core::get_framebuffer_size();

  // build AABBs, doesn't need to run every frame but fast
  clusterComp.use();
  clusterComp.set_float("zNear", camera.near);
  clusterComp.set_float("zFar", camera.far);
  clusterComp.set_mat4("inverseProjection", glm::inverse(camera.projection));
  clusterComp.set_uvec3("gridSize", {gridSizeX, gridSizeY, gridSizeZ});
  clusterComp.set_uvec2("screenDimensions", {width, height});

  glDispatchCompute(gridSizeX, gridSizeY, gridSizeZ);
  glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);

  // cull lights
  cullLightComp.use();
  cullLightComp.set_mat4("viewMatrix", camera.view);

  glDispatchCompute(27, 1, 1);
  glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);
}

void init()
{
  init_ssbos();
  // load shaders
  clusterComp = Shader("clusterShader.comp");
  cullLightComp = Shader("clusterCullLightShader.comp");
}

The compute shader has 128 "threads" per workgroup. We dispatch 27 workgroups for a total of 3456 threads. This is to fit the design of each thread processing a single cluster. Remember we have 12x12x24 = 3456 clusters. If you change anything, make sure to change your dispatch to match the total thread count with the number of clusters.

Also, since each thread processes its own cluster and writes to its own part of the SSBO memory, we don't need to use any shared memory or atomic operations! This keeps the compute shader as parallel as possible.

Step 3: Consumption in fragment shader

Now that we have built our cluster grid and assigned lights to clusters, we can finally consume this data.

We calculate the cluster a fragment is in, retrieve the light list for that cluster, and do cool lighting. This is basically reversing the calculations in the cluster compute shader to solve for the xyz indexes of a cluster.

Your lighting shader should look something like this

#version 430 core

// PointLight and Cluster struct definitions
//...
// bind to light and cluster ssbo
// same as cull compute shader

uniform float zNear;
uniform float zFar;
uniform uvec3 gridSize;
uniform uvec2 screenDimensions;

out vec4 FragColor;

void main()
{
    //view space position of a fragment. Replace with your implementation
    vec3 FragPos = texture(gPosition, TexCoords).rgb;

    // Locating which cluster this fragment is part of
    uint zTile = uint((log(abs(FragPos.z) / zNear) * gridSize.z) / log(zFar / zNear));
    vec2 tileSize = screenDimensions / gridSize.xy;
    uvec3 tile = uvec3(gl_FragCoord.xy / tileSize, zTile);
    uint tileIndex =
        tile.x + (tile.y * gridSize.x) + (tile.z * gridSize.x * gridSize.y);

    uint lightCount = clusters[tileIndex].count;

    for (int i = 0; i < lightCount; ++i)
    {
        uint lightIndex = clusters[tileIndex].lightIndices[i];
        PointLight light = pointLight[lightIndex];
        // do cool lighting
    }
}

Here FragPos is the view space position of the fragment. The absolute value of FragPos.z gives us the positive Z depth of the fragment from the camera. Remember, that's exactly the left hand side of the exponential equation from earlier.

Solving that earlier equation for the slice results in the z index of the cluster.

uint zTile = uint((log(abs(FragPos.z) / zNear) * gridSize.z) / log(zFar / zNear));

Finding the xy index of the cluster is very simple. We have the screen coordinates of the fragment fromgl_FragCoord.xy, we just need to divide by the tileSize. Again, this is the reverse of what the cluster compute shader does.

Common Problems

A common problem is flickering artifacts. This could be either:

  1. Your light is affecting fragments outside its defined radius. This causes uneven lighting. Try adding a range check to your attenuation.

  2. There are too many lights visible to a single cluster. Remember we hardcoded a max of 100 lights per cluster at any time. If this limit is hit, further intersecting lights will be ignored, and their assignment will become unpredictable. This can happen at further out clusters, since the exponential division causes those clusters to be very large.

    Solution: Increase the light limit. The only cost is more GPU memory. You can also add a check in your lighting shader to output a warning color.

     uint lightCount = clusters[tileIndex].count;
     if (lightCount > 95) {
         //getting close to limit. Output red color and dip
         FragColor = vec4(1.0f, 0.0f, 0.0f, 1.0f);
         return;
     }

Benchmarks

The following benchmarks were measured using glFinish() and regular C++ clocks on my linux machine using an Intel CometLake-H GT2. I found the integrated gpu results were more in line with what I expected. It also shows the competitiveness of the algorithm on low-end hardware.

The scene uses cluster shading with deferred rendering without any optimizations like frustum culling.

  • 12x12x24 cluster grid
  • Camera near and far planes (0.1, 400)
  • Light XZ positions allowed to range (-100, 100) and vertical Y (0, 55)
  • 1920x1080 resolution
  • Sponza model
Building Clusters Light Assignment Shading
512 lights (13.0f) 0.28 ms 0.95 ms 5.23 ms
1,024 lights (7.0f) 0.27 ms 1.50 ms 3.71 ms
2,048 lights (3.0f) 0.42 ms 2.61 ms 2.84 ms
4,096 lights (2.0f) 0.29 ms 5.15 ms 3.28 ms

Optimization

The benchmarks show constructing the cluster grid takes constant time, while shading perf is largely affected by light radius. However, a bottleneck starts to appear in assigning lights to clusters. This makes sense since we are brute force testing every light against every cluster. We need some way to reduce the number of lights being tested.

One way is to build a BVH (Bounding Volume Hierarchy) over the lights and traverse it in the culling step. This can produce good results, but IMO it overcomplicates things. Clustered shading is already an optimization technique, and I have doubts about spiraling into a rabbit hole of optimizing the optimizers.

The easiest solution here is to frustum cull the lights and update the light SSBO every frame. Thus, we only test lights that are in the view frustum. Frustum culling is fast and already standard in many games.

Further Reading


Questions, typos, or corrections? Feel free to open an issue or pull request!

clustered-shading's People

Contributors

daveh355 avatar

Stargazers

Yury Shulaev avatar Alkrist avatar Stuart Millman avatar Doner avatar  avatar Dawid avatar Chase Carlson avatar  avatar  avatar johnb avatar Dmitrii Sychev avatar  avatar Jonathan Gill avatar Vicente A Spencer avatar Igor Pelekhan avatar  avatar Denis Beqiraj avatar

Watchers

 avatar

clustered-shading's Issues

Great tutorial, only one detail

Hi Dave, so far one of the best tutorials i've ever read about clustered shading, one precisation(tell me if i am wrong) in the README where you specify numClusters you mean gridSizeX * gridSizeY * gridSizeZ right? If so the rest looks good and i've succesfully implemented in my engine.
Thanks and yes this really should be integrated in learnopengl.com

Deferred lighting cluster question [LWJGL]

Hello, thank you very much for the tutorial! I am currently looking for a way to implement cheap multiple lights for my LWJGL-based engine and you described everything very well!

However, I faced one issue: lights are distributed only to some clusters and they have very sharp borders. After some testing, I got to know that they do not pass AABB intersection check. If I increase the radius from 5.0 to a value near 50.0, then lights cover the scene properly, but still have a lot of visual glitches. (on a screenshot, the objects are located in z ~ -30, camera is at world origin facing negative z, positive x is to the left)
Снимок экрана (334)

For the scene,I use 3 point lights with radius 5.0:
pointLights.add(new PointLight(new Vector3f(-10, 5, -60), new Vector3f(1, 0, 0), 0.3f, 5.0f)); pointLights.add(new PointLight(new Vector3f(20, 5, -60), new Vector3f(0, 1, 0), 0.3f, 5.0f)); pointLights.add(new PointLight(new Vector3f(-30, 5, -60), new Vector3f(0, 0, 1), 0.3f, 5.0f));
(first value is position, second is color, third is intensity, fourth is radius).

This is my clustered lighting class, which contains shaders, deferred scene texture and methods to update lights and draw deferred scene:
` public class TestCluster {

private DeferredClusterShader shader;
private DeferredLightAABBCullingShader lightCullingShader;
private DeferredTestShader lightingShader;

private Texture deferredSceneTexture;

private int ssbo;
private int lightSSBO;

private int gridSizeX = 12;
private int gridSizeY = 12;
private int gridSizeZ = 24;
private int numClusters = gridSizeX * gridSizeY * gridSizeZ;

private final int clusterAlignedSize = 448; // (436 + 16 - 1) & ~(16 - 1) = 448 bytes
private final int lightAlignedSize = 48; // (12 + 12 + 4 + 4) = 32 bytes

private int width;
private int height;

private Mesh mesh;
private RenderParameter config;

private boolean isFirstPass = true;

public TestCluster(int width,int height) {
	this.width = width;
	this.height = height;
	this.shader = DeferredClusterShader.getInstance();
	this.lightCullingShader = DeferredLightAABBCullingShader.getInstance();
	this.lightingShader = DeferredTestShader.getInstance();
	
	deferredSceneTexture = new Texture2D(width, height, 
			ImageFormat.RGBA16FLOAT, SamplerFilter.Nearest, TextureWrapMode.ClampToEdge);
	
	createSSBO();
	
}

private void createSSBO() {
	this.ssbo = glGenBuffers();
	glBindBuffer(GL_SHADER_STORAGE_BUFFER, ssbo);
	

	glBufferData(GL_SHADER_STORAGE_BUFFER, clusterAlignedSize * numClusters, GL_STATIC_COPY);
	
	glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, ssbo);
}

public void cullLightsCompute() {
	shader.bind();
	shader.updateUniforms(width, height, gridSizeX, gridSizeY, gridSizeZ);
	glDispatchCompute(gridSizeX, gridSizeY, gridSizeZ);
	glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);
}

public void updateLightSSBO(List<PointLight> pointLights) {
	ByteBuffer lightData = BufferUtils.createByteBuffer(lightAlignedSize * pointLights.size());
	FloatBuffer fv = lightData.asFloatBuffer();
	
	for(PointLight light: pointLights) {
		Vector3f p = light.getPosition();
		Vector3f c = light.getColor();
		float i = light.getIntensity();
		float r = light.getRadius();
		
		fv.put(p.x).put(p.y).put(p.z).put(1.0f)
		.put(c.x).put(c.y).put(c.z).put(1.0f)
		.put(i)
		.put(r)
		.put(0.0f).put(0.0f); // 4 + 4 bytes padding
	}
	
	fv.flip();
	
	if(isFirstPass) {
		this.lightSSBO = glGenBuffers();
		isFirstPass = false;
	}
	
	glBindBuffer(GL_SHADER_STORAGE_BUFFER, lightSSBO);
	glBufferData(GL_SHADER_STORAGE_BUFFER, lightData, GL_DYNAMIC_DRAW); // in LWJGL we can put bytes directly without specifying capacity
	glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 2, lightSSBO);
	
}

public void lightAABBIntersection() {
	lightCullingShader.bind();
	lightCullingShader.updateUniforms(GLContext.getMainCamera().getViewMatrix());
	glDispatchCompute(27,1,1); // for 12x12x24 work groups of cluster shader we have 3456 threads, same as for 27 work groups
	glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);
}

public void render(Texture albedo, Texture position, Texture normal, Texture specularEmissionDiffuseSSAOBloom, Texture depth) {
	lightingShader.bind();
	glBindImageTexture(0, deferredSceneTexture.getId(), 0, false, 0, GL_WRITE_ONLY, GL_RGBA16F);
	glBindImageTexture(2, albedo.getId(), 0, false, 0, GL_READ_ONLY, GL_RGBA16F);
	glBindImageTexture(3, position.getId(), 0, false, 0, GL_READ_ONLY, GL_RGBA32F);
	glBindImageTexture(4, normal.getId(), 0, false, 0, GL_READ_ONLY, GL_RGBA16F);
	glBindImageTexture(5, specularEmissionDiffuseSSAOBloom.getId(), 0, false, 0, GL_READ_ONLY, GL_RGBA16F);
	
	lightingShader.updateUniforms(width, height, gridSizeX, gridSizeY, gridSizeZ);
	
	glDispatchCompute(width/2, height/2,1);
}

public Texture getDeferredSceneTexture() {
	return deferredSceneTexture;
}

}`

In shader class, nothing is very specific, just compile the program and update the uniforms. Since java does not have struct like in your example in C++, I use byte buffer to load light data to SSBO. Basically,I just take data from light object and put it to the buffer with 8 padding bytes. This is tested and gives proper results in shader.

For lighting I use compute shader as seen in the code and get the fragment position from global invocation ID, which gives precise result with screenSize / 2 work groups and local size of 2:

`#version 430 core

layout (local_size_x = 2, local_size_y = 2) in;

layout (binding = 0, rgba16f) uniform writeonly image2D defferedSceneImage;

layout (binding = 2, rgba16f) uniform readonly image2DMS albedoSampler;
layout (binding = 3, rgba32f) uniform readonly image2DMS worldPositionSampler;
layout (binding = 4, rgba16f) uniform readonly image2DMS normalSampler;
layout (binding = 5, rgba16f) uniform readonly image2DMS specular_emission_diffuse_ssao_bloom_Sampler;

struct PointLight {
vec4 position;
vec4 color;
float intensity;
float radius;
};

struct Cluster {
vec4 minPoint;
vec4 maxPoint;
uint count;
uint lightIndices[100];
};

layout(std430, binding = 1) restrict buffer clusterSSBO
{
Cluster clusters[];
};

layout(std430, binding = 2) restrict buffer lightSSBO
{
PointLight pointLight[];
};

uniform float zNear;
uniform float zFar;
uniform uvec3 gridSize;
uniform uvec2 screenDimensions;

uniform mat4 viewMatrix;

vec3 diffuse(vec3 albedo, vec3 normal, vec3 toLightDirection, vec3 lightColor,
	float lightIntensity) {
float diffuseFactor = max(dot(normal, toLightDirection), 0.0);
return albedo * lightColor * diffuseFactor * lightIntensity;
}

vec3 specular(vec3 normal, vec3 position, vec3 toLightDir, vec3 lightColor,
	float reflectance, float emission, float lightIntensity) {
vec3 cameraDirection = normalize(-position);
vec3 fromLightDir = -toLightDir;
vec3 reflectedLight = normalize(reflect(fromLightDir, normal));
float specularFactor = max(dot(cameraDirection, reflectedLight), 0.0);
specularFactor = pow(specularFactor, emission);
return lightIntensity * specularFactor * reflectance * lightColor;
}

vec3 calculateLight(vec3 albedo, vec3 position, vec3 normal, PointLight light,
	float reflectance, float emission) {
vec3 lightDirection = light.position.xyz - position;
vec3 toLightDir = normalize(lightDirection);

vec3 diffuse = diffuse(albedo, normal, toLightDir, light.color.rgb,
		light.intensity);
vec3 specular = specular(normal, position, toLightDir, light.color.rgb,
		reflectance, emission, light.intensity);

// Attenuation
float distance = length(light.position.xyz - position);
float attenuation = 1.0 / (1.0 + (distance / light.radius));

diffuse *= attenuation;
specular *= attenuation;

return diffuse + specular;
}

void main(void) {
    // compute coord represents coordinates in screen space
ivec2 computeCoord = ivec2(gl_GlobalInvocationID.x,
		gl_GlobalInvocationID.y);

vec3 finalColor = vec3(0);
vec3 albedo = vec3(0);
vec3 position = vec3(0);
vec4 normal = vec4(0);
vec4 specular_emission_diffuse_ssao_bloom = vec4(0);
vec4 depth = vec4(0);

albedo = imageLoad(albedoSampler, computeCoord, 0).rgb;
normal = imageLoad(normalSampler, computeCoord, 0).rbga;

position = imageLoad(worldPositionSampler, computeCoord, 0).rgb;
specular_emission_diffuse_ssao_bloom = imageLoad(
		specular_emission_diffuse_ssao_bloom_Sampler, computeCoord, 0).rgba;
		
    // we get fragment position from sampled world pos multiplied by view matrix
uint zTile = uint(
		(log(abs(vec3(viewMatrix * vec4(position, 1.0)).z) / zNear)
				* gridSize.z) / log(zFar / zNear));
vec2 tileSize = screenDimensions / gridSize.xy;

uvec3 tile = uvec3(computeCoord.xy / tileSize, zTile);
uint tileIndex = tile.x + (tile.y * gridSize.x)
		+ (tile.z * gridSize.x * gridSize.y);

uint lightCount = clusters[tileIndex].count;

for (int i = 0; i < lightCount; ++i) {
	uint lightIndex = clusters[tileIndex].lightIndices[i];
	PointLight light = pointLight[lightIndex];

	// Lighting
	finalColor += calculateLight(albedo, position, normalize(normal.xyz),
			light, specular_emission_diffuse_ssao_bloom.r,
			specular_emission_diffuse_ssao_bloom.g);

}

imageStore(defferedSceneImage, computeCoord, vec4(finalColor, 1.0));
}`

The shaders for clusters and AABB intersection are not different from your example from the tutorial.
This screenshot demonstrates when I render tile.xyz (multiplied by 0.01):
tiles
And this is depth of tiles (zTile * 0.01):
tilez

In render loop, it goes in the following order:

  1. Init clustered renderer class, run cluster shader to make a grid and fill cluster SSBO
  2. add lights to SSBO and to first AABB intersection check for them (3 lights for now)
  3. update loop:
  • update camera and scene
  • forward render scene to G-buffer
  • test lights again for AABB intersection using shader from tutorial, update light indices in cluster SSBO
  • use compute shader to do lighting to a texture
  • post-processing stage on deferred scene texture
  • draw final texture on screen

So far, clusters are defined correctly, however, in view space they all have negative z position, which also seems fine (or not?). The issue starts when testing AABB sphere intersection, after a bit of testing I got to know that if I do not load view matrix (view matrix will be not identity, but all-zeros), then light position will be zero and the check if distance squared <= radius squared will be always true, however, when I load a view matrix (even identity matrix), the check will return false in most of the cases. It will be true in case the light position is actually inside the cluster (which I am not sure, I judged it by the colored clusters as you see on the screenshot above). Just in case, this is my view matrix calculation (it is very basic view matrix definition):
`
private Matrix4f updateViewMatrix() {

	viewMatrix.identity();
	
	viewMatrix.rotate((float)Math.toRadians(pitch), new Vector3f(1,0,0));
	viewMatrix.rotate((float)Math.toRadians(yaw), new Vector3f(0,1,0));
	viewMatrix.rotate((float)Math.toRadians(roll), new Vector3f(0,0,1));
	
	Vector3f negativePos = new Vector3f(-position.x, -position.y, -position.z);
	viewMatrix.translate(negativePos);
	
	this.invViewMatirx.set(viewMatrix).invert();
	
	return viewMatrix;
}`

So, at this moment I don't know where exactly can be the mistake and I'm trying to fix it.
I know that your example is not in LWJGL, though the structure is very much alike. I would really appreciate if you can give some suggestions because I don't know where else can I search for the information (I've asked many people in GPU programming dpt at my university, but clustered lighting seems to be very unique concept which not many people are aware of).

Thank you for the tutorial and any possible help! :)

Specs:
width: 1280
height: 720
zNear: 0.1
zFar: 10000
number of lights: 3
grid size: 12x12x24
OpenGL: 4.5

EDIT:
For better understanding, I applied albedo texture for all pixels where light count = 0:
Снимок экрана (337)

Now I tried to do lighting in forward pass to check if determining screen space position by global invocation ID is wrong. Apparently, it's not. Same issue occurs when rendered in fragment shader in forward pass.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.