Coder Social home page Coder Social logo

collective_profiler's People

Contributors

cjlegg avatar gvallee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

collective_profiler's Issues

Call selection for comparison

Enable the capability to select a call and compare others to it. For instance, be able to select 1->N individual calls and highlight the calls that are different in nature. At first, do not focus on actual values, just patterns.

Write a test/validation script

We now have a good set of tests and we know exactly what output is expected, we should therefore have a tool that checks that everything is correct while executing these tests with a pre-defined number of ranks.

Create a rank file with profiling data

We need to track which rank runs where. I personally believe it would be beneficial to track PIDs to correlate processes to rank across different communicators.

Add pattern-level bin capability

Right now we save the bins' data separately because we do not have a good way at the moment to mix bins and patterns (bins are specific to a count file, not a call; we could change that but it would take time).
We could then update the analysis of sub-communicators' results to precisely detail bins for given patterns.

User process for using the post processing tools

The steps to be taken by the user in processing the captured data is not documented. There is a rendered graph (.png) in the doc directory showing the data sets and programs and arrows showing data sets are inputs and outputs of each program. However the middle row has some files which are supposedly the output of two different programs. If that it is indeed correct the user will be puzzled how can that be so and will ask which program do they have to run first to ensure that the correct final output is generated.

Save data in the context of the .so destructor

Since we cannot make any assumption that MPI_Finalize() is called by the application, it might be better to rely on the shared library destructor:

void __attribute__((destructor)) exit_handler();
void exit_handler()
{
	log_data(...);
}

Profiling tool - step 5 is long running

This step (step 5) generates 1 graph per call that was captured. So this can be O(10,000) graphs and takes hours to generate. This should therefore be parallelised or consideration should be given as to whether all the graphs need to be precalculated; they could instead be gnerated on the fly when a user wishes to view one. (A detail is that the generation may be in two steps - is there a data calculation step followed by a render step - I do not know which takes the time.)

'profile' fails when executed against 'alltoallv_f'

$ ../tools/cmd/profile/profile -dir .
* Step 1/5: analyzing counts...
Reading count files: 2/2
Analyzing alltoallv calls: 2/2
Bin creation: 2/2
Step completed in 3.433329ms

* Step 2/5: analyzing MPI communicator data...
Step completed in 110.164µs

* Step 3/5: create maps...
Gathering map data: 2/2
Step completed in 1.439437ms

* Step 4/5: analyzing timing files...
Step completed in 308.865µs

* Step 5/5: generating plots...
Plotting data for alltoallv calls: 1/1
panic: runtime error: index out of range [0] with length 0

goroutine 1 [running]:
github.com/gvallee/alltoallv_profiling/tools/internal/pkg/plot.write(0xc00000e158, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x529167, ...)
	/home/gvallee/src/alltoall_profiling/tools/internal/pkg/plot/plot.go:350 +0x11a6
github.com/gvallee/alltoallv_profiling/tools/internal/pkg/plot.generateCallPlotScript(0x7fff1b387650, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/gvallee/src/alltoall_profiling/tools/internal/pkg/plot/plot.go:388 +0x4aa
github.com/gvallee/alltoallv_profiling/tools/internal/pkg/plot.generateCallDataFiles(0x7fff1b387650, 0x1, 0x7fff1b387650, 0x1, 0x0, 0x0, 0xc000010c30, 0xc000010d50, 0xc000010e40, 0xc0000112c0, ...)
	/home/gvallee/src/alltoall_profiling/tools/internal/pkg/plot/plot.go:311 +0x566
github.com/gvallee/alltoallv_profiling/tools/internal/pkg/plot.CallsData(0x7fff1b387650, 0x1, 0x7fff1b387650, 0x1, 0x0, 0x0, 0xc000010c30, 0xc000010d50, 0xc000010e40, 0xc0000112c0, ...)
	/home/gvallee/src/alltoall_profiling/tools/internal/pkg/plot/plot.go:444 +0xb4
main.plotCallsData(0x7fff1b387650, 0x1, 0xc000046810, 0x1, 0x1, 0xc000010ae0, 0xc000010ab0, 0xc000011230, 0xc000011260, 0x0, ...)
	/home/gvallee/src/alltoall_profiling/tools/cmd/profile/profile.go:37 +0x2a2
main.main()
	/home/gvallee/src/alltoall_profiling/tools/cmd/profile/profile.go:134 +0x13b9

Write a parser for receive counts

We only have a parser for send counts and we need the same type of parser for the receive counts to have a fully featured validation tool. Required by #5

Compression the send/recv arrays for a specific calls

Right now, we can compare whether two send/recv counts for two calls are the same and if so we track what calls are associated to the counts instead of duplicating them. The same should be done within a call: when two ranks have the same counts, save which ranks have that specific count instead of duplicating.

Profiling tool - step 5 - clarity of text output

the prgress message during step 5 reads “* Step 5/5: generating plots... Plotting data for alltoallv calls: 2110/1” but it is not clear what "/1" means. (The 2110 is the number of the captured call being rendered and is updated for as each one is rendered).

Find a way to get all the data when MPI_Finalize is never called

I am dealing with an app that does not call MPI_Finalize(). As a result, the profiling data is never written to the profile files that therefore end up being empty. We need a way to force the dump of all the data during a specific alltoallv call and document how users can check whether MPI_Finalize() is actually called and if not, find the number of alltoallv calls and set the library to dump the data at the end of the last alltoallv call.

Validation tool: run OSU alltoallv test

The Alltoallv test from the OSU benchmarks are a good set up from the examples. So we should also use it for validation.
This is the high-level issue to do such integration, sub-issues may come later when work will start.

Expend scaling to set scale upfront

We need to support the following workflow:

  1. Calculate the scaled amount of data send and received.

  2. For there, figure out the scale of the bandwidth

  3. Update the bandwidth data with the scale

In other terms, we need to be able to set the scale and update the data, which the scale package does not currently support.

Write tool that allows me to merge traces

I now rely on the capability to split the tracing into chunks: profile call 0-999, then 1000-1999 and so on. So now i need a tool that would merge all these traces back together. Outside of the trace size, it should be easy: at first, use the first file (0-999) as is and add the other files one by one. When parsing a file, extract the counters and do a string comparison. If it already exists, increment the call counter and add the call ID to the list of calls associated to the pattern.

Add the format version to the metadata of the files generated by the profiler

A typical issue when looking at data after the fact is to know what format was used. Without it, it is difficult to know what version of the tool is required to do post-mortem analysis. We already have a file to track the version of the data format but we do not use it when generating data. We need to include it to the generated file. It would be okay to just include it in the file name for now.

Enable loading maps from files

Maps are saved to files but we cannot load them from files, resulting in a lot of computation when we bring the WebUI up.

Update getbins to deal with sub-communicators

Right now, the assumption here is that we do not deal with alltoallv calls on subcommunicators, otherwise we would need to deal with jobid and rank when calling SaveBins(). For now we set them to -1. Once it is fixed, we can simply the code of SaveBins by removing the code specific to the case where the jobid and rank are not provided

Do not re-create bins if not necessary

At the moment, every time we start the server, the bins are recreated, which takes time with a large dataset. Avoid when resulting files already exist.

Fix the management of unique ID

Right now, we have a Bcast to send a unique ID to ranks to make it easier to identify files. Use SLURM_JOB_ID and document that even when Slurm is not used, the user is responsible for setting it up in order to identify files.

Dynamically increase the array to track calls

Right now we have a static size of MAX_TRACKED_CALLS when tracking calls in the context of counts. If we reach the max, we display a message but I need to switch to dynamic arrays that are automatically increased

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.