wumpf / wgpu-profiler Goto Github PK

View Code? Open in Web Editor NEW

85.0 6.0 19.0 200 KB

Simple profiler scopes for wgpu using timer queries

License: Apache License 2.0

Rust 100.00%

wgpu wgpu-rs profiler gpu

wgpu-profiler's Introduction

wgpu-profiler

Simple profiler scopes for wgpu using timer queries

Features

Easy to use profiler scopes
- Allows nesting!
- Can be disabled by runtime flag
- Additionally generates debug markers
- Thread-safe - can profile several command encoder/buffers in parallel
Internally creates pools of timer queries automatically
- Does not need to know in advance how many queries/profiling scopes are needed
- Caches up profiler-frames until results are available
  - No stalling of the device at any time!
Many profiler instances can live side by side
chrome trace flamegraph json export
Tracy integration (behind tracy feature flag)

How to use

Create a new profiler object:

use wgpu_profiler::{wgpu_profiler, GpuProfiler, GpuProfilerSettings};
// ...
let mut profiler = GpuProfiler::new(GpuProfilerSettings::default());

Now you can start creating profiler scopes:

// You can now open profiling scopes on any encoder or pass:
let mut scope = profiler.scope("name of your scope", &mut encoder, &device);

// Scopes can be nested arbitrarily!
let mut nested_scope = scope.scope("nested!", &device);

// Scopes on encoders can be used to easily create profiled passes!
let mut compute_pass = nested_scope.scoped_compute_pass("profiled compute", &device);

// Scopes expose the underlying encoder or pass they wrap:
compute_pass.set_pipeline(&pipeline);
// ...

// Scopes created this way are automatically closed when dropped.

GpuProfiler reads the device features on first use:

wgpu::Features::TIMESTAMP_QUERY is required to emit any timer queries.
- Alone, this allows you to use timestamp writes on pass definition as done by Scope::scoped_compute_pass/Scope::scoped_render_pass
wgpu::Features::TIMESTAMP_QUERY_INSIDE_ENCODERS is required to issue queries at any point within encoders.
wgpu::Features::TIMESTAMP_QUERY_INSIDE_PASSES is required to issue queries at any point within passes.

Wgpu-profiler needs to insert buffer copy commands, so when you're done with an encoder and won't do any more profiling scopes on it, you need to resolve the queries:

profiler.resolve_queries(&mut encoder);

And finally, to end a profiling frame, call end_frame. This does a few checks and will let you know if something is off!

profiler.end_frame().unwrap();

Retrieving the oldest available frame and writing it out to a chrome trace file.

if let Some(profiling_data) = profiler.process_finished_frame(queue.get_timestamp_period()) {
    wgpu_profiler::chrometrace::write_chrometrace(std::path::Path::new("mytrace.json"), &profiling_data);
}

To get a look of it in action, check out the example project!

License

Licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

wgpu-profiler's People

Contributors

Stargazers

Watchers

Forkers

imberflur aron-granberg vultix jcapucho cwfitzgerald icandivideby0 johann2 davidster iq-scm vini-fda elabajaba waywardmonkeys zoxc gents83 bugthesystem erikwdev xstrom

wgpu-profiler's Issues

Scope guard types

The way the project I work on handles state transitions between e.g. pipelines while drawing with wgpu it isn't feasible to put things inside a scope macro. So I made the scope types (see below) which wrap the CommandEncoder/RenderPass and utilize Drop. I was wondering if something like these would be useful in this crate or make more sense in a separate crate or simply created by users as needed?

pub struct Scope<'a, W: ProfilerCommandRecorder> {
    profiler: &'a mut GpuProfiler,
    wgpu_thing: &'a mut W,
}

pub struct OwningScope<'a, W: ProfilerCommandRecorder> {
    profiler: &'a mut GpuProfiler,
    wgpu_thing: W,
}

// Separate type since we can't destructure types that impl Drop :/
pub struct ManualOwningScope<'a, W: ProfilerCommandRecorder> {
    profiler: &'a mut GpuProfiler,
    wgpu_thing: W,
}

impl<'a, W: ProfilerCommandRecorder> Scope<'a, W> {
    pub fn start(
        profiler: &'a mut GpuProfiler,
        wgpu_thing: &'a mut W,
        device: &wgpu::Device,
        label: &str,
    ) -> Self {
        profiler.begin_scope(label, wgpu_thing, device);
        Self {
            profiler,
            wgpu_thing,
        }
    }

    /// Starts a scope nested within this one
    pub fn scope(&mut self, device: &wgpu::Device, label: &str) -> Scope<'_, W> {
        Scope::start(self.profiler, self.wgpu_thing, device, label)
    }
}

impl<'a, W: ProfilerCommandRecorder> OwningScope<'a, W> {
    pub fn start(
        profiler: &'a mut GpuProfiler,
        mut wgpu_thing: W,
        device: &wgpu::Device,
        label: &str,
    ) -> Self {
        profiler.begin_scope(label, &mut wgpu_thing, device);
        Self {
            profiler,
            wgpu_thing,
        }
    }

    /// Starts a scope nested within this one
    pub fn scope(&mut self, device: &wgpu::Device, label: &str) -> Scope<'_, W> {
        Scope::start(self.profiler, &mut self.wgpu_thing, device, label)
    }
}

impl<'a, W: ProfilerCommandRecorder> ManualOwningScope<'a, W> {
    pub fn start(
        profiler: &'a mut GpuProfiler,
        mut wgpu_thing: W,
        device: &wgpu::Device,
        label: &str,
    ) -> Self {
        profiler.begin_scope(label, &mut wgpu_thing, device);
        Self {
            profiler,
            wgpu_thing,
        }
    }

    /// Starts a scope nested within this one
    pub fn scope(&mut self, device: &wgpu::Device, label: &str) -> Scope<'_, W> {
        Scope::start(self.profiler, &mut self.wgpu_thing, device, label)
    }

    /// Ends the scope allowing the extraction of owned the wgpu thing
    /// and the mutable reference to the GpuProfiler
    pub fn end_scope(mut self) -> (W, &'a mut GpuProfiler) {
        self.profiler.end_scope(&mut self.wgpu_thing);
        (self.wgpu_thing, self.profiler)
    }
}
impl<'a> Scope<'a, wgpu::CommandEncoder> {
    /// Start a render pass wrapped in an OwnedScope
    pub fn scoped_render_pass<'b>(
        &'b mut self,
        device: &wgpu::Device,
        label: &str,
        pass_descriptor: &wgpu::RenderPassDescriptor<'b, '_>,
    ) -> OwningScope<'b, wgpu::RenderPass> {
        let render_pass = self.wgpu_thing.begin_render_pass(pass_descriptor);
        OwningScope::start(self.profiler, render_pass, device, label)
    }
}

impl<'a> OwningScope<'a, wgpu::CommandEncoder> {
    /// Start a render pass wrapped in an OwnedScope
    pub fn scoped_render_pass<'b>(
        &'b mut self,
        device: &wgpu::Device,
        label: &str,
        pass_descriptor: &wgpu::RenderPassDescriptor<'b, '_>,
    ) -> OwningScope<'b, wgpu::RenderPass> {
        let render_pass = self.wgpu_thing.begin_render_pass(pass_descriptor);
        OwningScope::start(self.profiler, render_pass, device, label)
    }
}

impl<'a> ManualOwningScope<'a, wgpu::CommandEncoder> {
    /// Start a render pass wrapped in an OwnedScope
    pub fn scoped_render_pass<'b>(
        &'b mut self,
        device: &wgpu::Device,
        label: &str,
        pass_descriptor: &wgpu::RenderPassDescriptor<'b, '_>,
    ) -> OwningScope<'b, wgpu::RenderPass> {
        let render_pass = self.wgpu_thing.begin_render_pass(pass_descriptor);
        OwningScope::start(self.profiler, render_pass, device, label)
    }
}

// Scope
impl<'a, W: ProfilerCommandRecorder> std::ops::Deref for Scope<'a, W> {
    type Target = W;

    fn deref(&self) -> &Self::Target { self.wgpu_thing }
}

impl<'a, W: ProfilerCommandRecorder> std::ops::DerefMut for Scope<'a, W> {
    fn deref_mut(&mut self) -> &mut Self::Target { self.wgpu_thing }
}

impl<'a, W: ProfilerCommandRecorder> Drop for Scope<'a, W> {
    fn drop(&mut self) { self.profiler.end_scope(self.wgpu_thing); }
}

// OwningScope
impl<'a, W: ProfilerCommandRecorder> std::ops::Deref for OwningScope<'a, W> {
    type Target = W;

    fn deref(&self) -> &Self::Target { &self.wgpu_thing }
}

impl<'a, W: ProfilerCommandRecorder> std::ops::DerefMut for OwningScope<'a, W> {
    fn deref_mut(&mut self) -> &mut Self::Target { &mut self.wgpu_thing }
}

impl<'a, W: ProfilerCommandRecorder> Drop for OwningScope<'a, W> {
    fn drop(&mut self) { self.profiler.end_scope(&mut self.wgpu_thing); }
}

// ManualOwningScope
impl<'a, W: ProfilerCommandRecorder> std::ops::Deref for ManualOwningScope<'a, W> {
    type Target = W;

    fn deref(&self) -> &Self::Target { &self.wgpu_thing }
}

impl<'a, W: ProfilerCommandRecorder> std::ops::DerefMut for ManualOwningScope<'a, W> {
    fn deref_mut(&mut self) -> &mut Self::Target { &mut self.wgpu_thing }
}

Work gracefully with downlevel NONBLOCKING_QUERY_RESOLVE

https://docs.rs/wgpu/latest/wgpu/struct.DownlevelFlags.html#associatedconstant.NONBLOCKING_QUERY_RESOLVE

As of writing it's only on Desktop GL < 4.2

Not exactly sure what we need to do here exactly, but at the very least it should be tried out and documented how to work with it!

Add an easy way to print a profiler frame

Users have to do that manually right now if they don't use chrome trace.

Also the readme could really use some of that, people coming in have no ideas what data to expect!

Race condition when dropping frames

There's currently a problem with the code for dropping frames that can cause wgpu to crash, the problem happens when the user application submits a frame that stalls long enough so that a new frame is also queued but the new frame uses more pools than the previous frame and causes the pools to be dropped

wgpu-profiler/src/lib.rs

Lines 322 to 332 in 637877c

    
           fn reset_and_cache_unused_query_pools(&mut self, mut query_pools: Vec<QueryPool>) { 
        
               // If a pool was less than half of the size of the max frame, then we don't keep it. 
        
               // This way we're going to need less pools in upcoming frames and thus have less overhead in the long run. 
        
               let capacity_threshold = self.size_for_new_query_pools / 2; 
        
               for mut pool in query_pools.drain(..) { 
        
                   pool.reset(); 
        
                   if pool.capacity >= capacity_threshold { 
        
                       self.unused_pools.push(pool); 
        
                   } 
        
               } 
        
           }

so the timeline is something like:

1st frame submitted
2nd frame submitted
1st frame pools dropped
1st frame tries to finish

this results in the following error from wgpu:

PanicInfo: panicked at 'assertion failed: `(left == right)`
  left: `2`,
 right: `3`: Buffer[6][2] is no longer alive (ID: (6, 2, Vulkan))', wgpu/wgpu-core/src/hub.rs:371:9

I've also added some prints to the profiler code to better show the problem:

frame finished # The frame before the problematic frame finished without a problem
# We process the finished frame which causes the cache code to be invoked
# but the threshold isn't high enough to drop pools
Threshold: 24
# Queuing the new frame
# Buffer of pool 1 which has problems
Resolving to: Buffer { context: Context { type: "Native" }, id: ObjectId { id: Some(2305843017803628550) }, data: Any { .. }, map_context: Mutex { data: MapContext { total_size: 256, initial_range: 0..0, sub_ranges: [] } }, size: 256, usage: MAP_READ | COPY_DST }
# Buffer of pool 2 which has problems
Resolving to: Buffer { context: Context { type: "Native" }, id: ObjectId { id: Some(2305843013508661271) }, data: Any { .. }, map_context: Mutex { data: MapContext { total_size: 256, initial_range: 0..0, sub_ranges: [] } }, size: 256, usage: MAP_READ | COPY_DST }
Ending frame # The 1st frame is submitted and `end_frame` is called
# We start the 2nd frame
Resolving to: Buffer { context: Context { type: "Native" }, id: ObjectId { id: Some(2305843017803628574) }, data: Any { .. }, map_context: Mutex { data: MapContext { total_size: 384, initial_range: 0..0, sub_ranges: [] } }, size: 384, usage: MAP_READ | COPY_DST }
Resolving to: Buffer { context: Context { type: "Native" }, id: ObjectId { id: Some(2305843017803628577) }, data: Any { .. }, map_context: Mutex { data: MapContext { total_size: 608, initial_range: 0..0, sub_ranges: [] } }, size: 608, usage: MAP_READ | COPY_DST }
Ending frame # The 2nd frame is submitted
Dropping frame # Since the profiler was configured with only 1 max pending frame and the 1st frame hasn't finished the 1st frame is dropped
# This causes the cache code to run
Threshold: 38 # The 2nd frame uses more query pools causing `capacity_threshold` to increase
Destroying: Buffer { context: Context { type: "Native" }, id: ObjectId { id: Some(2305843017803628550) }, data: Any { .. }, map_context: Mutex { data: MapContext { total_size: 256, initial_range: 0..0, sub_ranges: [] } }, size: 256, usage: MAP_READ | COPY_DST } # Buffer of pool 1 doesn't pass the threshold
Destroying: Buffer { context: Context { type: "Native" }, id: ObjectId { id: Some(2305843013508661271) }, data: Any { .. }, map_context: Mutex { data: MapContext { total_size: 256, initial_range: 0..0, sub_ranges: [] } }, size: 256, usage: MAP_READ | COPY_DST } # Buffer of pool 2 doesn't pass the threshold
# Somewhere after the 1st frame finishes which causes wgpu to try transition the buffers

I also tried to come up with a small reproduction case based on the other dropped frames test but I couldn't find a way to simulate a stalled frame

async fn handle_dropped_frames_pool_increase_gracefully_async() {
    let instance = wgpu::Instance::new(wgpu::InstanceDescriptor::default());
    let adapter = instance.request_adapter(&wgpu::RequestAdapterOptions::default()).await.unwrap();
    let (device, queue) = adapter
        .request_device(
            &wgpu::DeviceDescriptor {
                features: wgpu::Features::TIMESTAMP_QUERY,
                ..Default::default()
            },
            None,
        )
        .await
        .unwrap();

    // max_num_pending_frames is one!
    let mut profiler = wgpu_profiler::GpuProfiler::new(1, queue.get_timestamp_period(), device.features());

    // Two frames without device poll, causing the profiler to drop a frame on the second round.
    {
        let mut encoder = device.create_command_encoder(&wgpu::CommandEncoderDescriptor::default());
        {
            let _ = wgpu_profiler::scope::Scope::start("testscope", &mut profiler, &mut encoder, &device);
        }
        profiler.resolve_queries(&mut encoder);

        queue.submit(std::iter::once(encoder.finish()));
        queue.on_submitted_work_done(|| {
            println!("done1");
        });

        profiler.end_frame().unwrap();

        // We haven't done a device poll, so there can't be a result!
        assert!(profiler.process_finished_frame().is_none());
    }
    {
        let mut encoder = device.create_command_encoder(&wgpu::CommandEncoderDescriptor::default());
        {
            let mut root = wgpu_profiler::scope::Scope::start("rootscope", &mut profiler, &mut encoder, &device);
            for _ in 0..32 {
                let _ = root.scope("nestedscope", &device);
            }
        }
        profiler.resolve_queries(&mut encoder);

        queue.submit(std::iter::once(encoder.finish()));
        queue.on_submitted_work_done(|| {
            println!("done2");
        });

        profiler.end_frame().unwrap();

        // We haven't done a device poll, so there can't be a result!
        assert!(profiler.process_finished_frame().is_none());
    }

    // Poll to explicitly trigger mapping callbacks.
    device.poll(wgpu::Maintain::Wait);

    // A single (!) frame should now be available.
    assert!(profiler.process_finished_frame().is_some());
    assert!(profiler.process_finished_frame().is_none());
}

Make use of pass timer queries

wgpu 0.18 supports timer queries directly on passes which don't need the INSIDE_PASSES timer feature. wgpu-profiler should have first class support for this!

scoped_render_pass and scoped_compute_pass don't need `#[must_use]`

It's hard to forget to use a render pass if you need it for something and some passes used for e.g. clearing a texture can be immediately dropped. Since these functions don't wrap a borrow of an existing value it isn't possible to drop them on accident and use the original wrapped value.

Broken wasm build due to std::process::id function call, is this a bug?

WASM build is broken in my project since upgrading wgpu-profiler. I guess it's been 'broken' since #30.

I'm not sure how you would like to proceed with this. On the one hand, I could probably find a way to never call wgpu-profiler functions from wasm builds. On the other hand, it was pretty nice being able to compile to web with no extra work required.

What do you think?

Document required wgpu features and/or have a function that returns them.

Although it's kind of obvious for anyone that has tried to use them directly it would be nice to note that this requires wgpu::Features::TIMESTAMP_QUERY and it might even be convenient to have a function that returns the set of required wgpu::Features.

Show current thread id and process id

Currently, PID and TID are hardcoded to show only the constant "1":

fn write_results_recursive(file: &mut File, result: &GpuTimerScopeResult, last: bool) -> std::io::Result<()> {
    write!(
        file,
        r#"{{ "pid":1, "tid":1, "ts":{}, "dur":{}, "ph":"X", "name":"{}" }}{}"#,
        result.time.start * 1000.0 * 1000.0,
        (result.time.end - result.time.start) * 1000.0 * 1000.0,
        result.label,
        if last && result.nested_scopes.is_empty() { "\n" } else { ",\n" }
    )?;
    if result.nested_scopes.is_empty() {
        return Ok(());
    }

    for child in result.nested_scopes.iter().take(result.nested_scopes.len() - 1) {
        write_results_recursive(file, child, false)?;
    }
    write_results_recursive(file, result.nested_scopes.last().unwrap(), last)?;

    Ok(())
    // { "pid":1, "tid":1, "ts":546867, "dur":121564, "ph":"X", "name":"DoThings"
}

They should be changed to output the proper process and thread id.

Suggestion: use std::process::id() for the PID and std::thread::current().id() for the TID. The only problem is that the ThreadId type in the standard library cannot be trivially converted to a primitive type (int/uint/usize), without using any hacks. Also, ThreadIds are under the control of Rust’s standard library and there may not be any relationship between ThreadId and the underlying platform’s notion of a thread identifier (see the official docs). There's a proposal for stabilizing a u64 conversion, though (see this tracking issue)

Add puffin integration

It should be possible to have both CPU & GPU traces be shown in sync in Puffin 🤔

(Add an example screenshot of that to readme if it works out!)

Slight docs inaccuracy

/// timestamp_period: Result of wgpu::Queue::get_timestamp_period()
This is a method on the Adapter

Integration with tracy_client

Now that tracy has proper C api support for GPU timestamps, it would be a great extension to this library to offer a tracy feature where the timestamps collected by wgpu-profiler are reported to tracy.

I am planning on working on this in a while but I wanted to clear it with you before I did.

Profiler can't be used accross threads or with interleaved scopes

By design the user is currently forced to use one profiler per thread. In fact even working with two command encoders on a single thread in an interleaved fashion is not possible right now since every call to end_scope needs its corresponding call to begin_scope to be in correct order.

begin_scope needs to return a handle that end_scope works with instead. Also, both functions need to work with interior mutability, each requiring &GpuProfiler instead of &mut GpuProfiler.
I believe forcing &mut GpuProfiler for all other methods (query resolve, processing frame etc.) still makes sense though since this greatly simplifies things both for implementation and potential error cases

Add a function for memory/resource report

Profiling the profiler! Give out an info struct with amount of buffers and queries used etc.

`GpuProfilerSettings` has broken doc links

Notice the square brackets and lack of active hyperlinks:

https://docs.rs/wgpu-profiler/0.16.0/wgpu_profiler/struct.GpuProfilerSettings.html

Looks like it lost its needed imports when it was moved to its own module.

	fn reset_and_cache_unused_query_pools(&mut self, mut query_pools: Vec<QueryPool>) {
	// If a pool was less than half of the size of the max frame, then we don't keep it.
	// This way we're going to need less pools in upcoming frames and thus have less overhead in the long run.
	let capacity_threshold = self.size_for_new_query_pools / 2;
	for mut pool in query_pools.drain(..) {
	pool.reset();
	if pool.capacity >= capacity_threshold {
	self.unused_pools.push(pool);
	}
	}
	}