gvallee / collective_profiler Goto Github PK
View Code? Open in Web Editor NEWLicense: BSD 3-Clause "New" or "Revised" License
License: BSD 3-Clause "New" or "Revised" License
Enable the capability to select a call and compare others to it. For instance, be able to select 1->N individual calls and highlight the calls that are different in nature. At first, do not focus on actual values, just patterns.
We now have a good set of tests and we know exactly what output is expected, we should therefore have a tool that checks that everything is correct while executing these tests with a pre-defined number of ranks.
Data size: add the total, avg, min, max. Right now we only have counts
Check that the number of calls in the file equate to the number of calls specify in the header
We need timing data for late arrival and time spent in alltoallv calls. When a rank is late, extract counters. Requires #1
For random ranks/calls, extract counters and later check whether the counter files provide the exact same data.
Requires #1
We need to track which rank runs where. I personally believe it would be beneficial to track PIDs to correlate processes to rank across different communicators.
Right now we save the bins' data separately because we do not have a good way at the moment to mix bins and patterns (bins are specific to a count file, not a call; we could change that but it would take time).
We could then update the analysis of sub-communicators' results to precisely detail bins for given patterns.
The steps to be taken by the user in processing the captured data is not documented. There is a rendered graph (.png) in the doc directory showing the data sets and programs and arrows showing data sets are inputs and outputs of each program. However the middle row has some files which are supposedly the output of two different programs. If that it is indeed correct the user will be puzzled how can that be so and will ask which program do they have to run first to ensure that the correct final output is generated.
Since we cannot make any assumption that MPI_Finalize()
is called by the application, it might be better to rely on the shared library destructor:
void __attribute__((destructor)) exit_handler();
void exit_handler()
{
log_data(...);
}
This step (step 5) generates 1 graph per call that was captured. So this can be O(10,000) graphs and takes hours to generate. This should therefore be parallelised or consideration should be given as to whether all the graphs need to be precalculated; they could instead be gnerated on the fly when a user wishes to view one. (A detail is that the generation may be in two steps - is there a data calculation step followed by a render step - I do not know which takes the time.)
All bin files seems to contain only 0
create a validate test
fix problems
Write a tool that extracts from the counter file(s) the counters for a specific rank and alltoallv call.
This will be used for validation
Requires #70
Once the format version is added to the meta-data, we can check if the current version of the post-mortem tool can handle it.
The code to handle counts and track counts for many calls seem to create problem. I need a good test to track this down. I label it as bug because it is need to track a bug.
$ ../tools/cmd/profile/profile -dir .
* Step 1/5: analyzing counts...
Reading count files: 2/2
Analyzing alltoallv calls: 2/2
Bin creation: 2/2
Step completed in 3.433329ms
* Step 2/5: analyzing MPI communicator data...
Step completed in 110.164µs
* Step 3/5: create maps...
Gathering map data: 2/2
Step completed in 1.439437ms
* Step 4/5: analyzing timing files...
Step completed in 308.865µs
* Step 5/5: generating plots...
Plotting data for alltoallv calls: 1/1
panic: runtime error: index out of range [0] with length 0
goroutine 1 [running]:
github.com/gvallee/alltoallv_profiling/tools/internal/pkg/plot.write(0xc00000e158, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x529167, ...)
/home/gvallee/src/alltoall_profiling/tools/internal/pkg/plot/plot.go:350 +0x11a6
github.com/gvallee/alltoallv_profiling/tools/internal/pkg/plot.generateCallPlotScript(0x7fff1b387650, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/home/gvallee/src/alltoall_profiling/tools/internal/pkg/plot/plot.go:388 +0x4aa
github.com/gvallee/alltoallv_profiling/tools/internal/pkg/plot.generateCallDataFiles(0x7fff1b387650, 0x1, 0x7fff1b387650, 0x1, 0x0, 0x0, 0xc000010c30, 0xc000010d50, 0xc000010e40, 0xc0000112c0, ...)
/home/gvallee/src/alltoall_profiling/tools/internal/pkg/plot/plot.go:311 +0x566
github.com/gvallee/alltoallv_profiling/tools/internal/pkg/plot.CallsData(0x7fff1b387650, 0x1, 0x7fff1b387650, 0x1, 0x0, 0x0, 0xc000010c30, 0xc000010d50, 0xc000010e40, 0xc0000112c0, ...)
/home/gvallee/src/alltoall_profiling/tools/internal/pkg/plot/plot.go:444 +0xb4
main.plotCallsData(0x7fff1b387650, 0x1, 0xc000046810, 0x1, 0x1, 0xc000010ae0, 0xc000010ab0, 0xc000011230, 0xc000011260, 0x0, ...)
/home/gvallee/src/alltoall_profiling/tools/cmd/profile/profile.go:37 +0x2a2
main.main()
/home/gvallee/src/alltoall_profiling/tools/cmd/profile/profile.go:134 +0x13b9
We only have a parser for send counts and we need the same type of parser for the receive counts to have a fully featured validation tool. Required by #5
Right now, we can compare whether two send/recv counts for two calls are the same and if so we track what calls are associated to the counts instead of duplicating them. The same should be done within a call: when two ranks have the same counts, save which ranks have that specific count instead of duplicating.
the prgress message during step 5 reads “* Step 5/5: generating plots... Plotting data for alltoallv calls: 2110/1” but it is not clear what "/1" means. (The 2110 is the number of the captured call being rendered and is updated for as each one is rendered).
I am dealing with an app that does not call MPI_Finalize()
. As a result, the profiling data is never written to the profile files that therefore end up being empty. We need a way to force the dump of all the data during a specific alltoallv call and document how users can check whether MPI_Finalize()
is actually called and if not, find the number of alltoallv calls and set the library to dump the data at the end of the last alltoallv call.
The Alltoallv test from the OSU benchmarks are a good set up from the examples. So we should also use it for validation.
This is the high-level issue to do such integration, sub-issues may come later when work will start.
We need to support the following workflow:
Calculate the scaled amount of data send and received.
For there, figure out the scale of the bandwidth
Update the bandwidth data with the scale
In other terms, we need to be able to set the scale and update the data, which the scale package does not currently support.
All send and receive count are equal to zero except for one rank, itself.
I now rely on the capability to split the tracing into chunks: profile call 0-999, then 1000-1999 and so on. So now i need a tool that would merge all these traces back together. Outside of the trace size, it should be easy: at first, use the first file (0-999) as is and add the other files one by one. When parsing a file, extract the counters and do a string comparison. If it already exists, increment the call counter and add the call ID to the list of calls associated to the pattern.
A typical issue when looking at data after the fact is to know what format was used. Without it, it is difficult to know what version of the tool is required to do post-mortem analysis. We already have a file to track the version of the data format but we do not use it when generating data. We need to include it to the generated file. It would be okay to just include it in the file name for now.
Switch all sprintf calls to snprintf and handle potential errors and truncated results
Right now the validation tests check that we can generate the profiles. We need to extend it to check if we can do post-mortem analysis and that we get the expected output.
Save patterns in a separate file.
Maps are saved to files but we cannot load them from files, resulting in a lot of computation when we bring the WebUI up.
Find a way to group patterns in 3 different groups:
The logging of data is based on the assumption that rank 0 on COMM_WORLD has all the data. We cannot make that assumption or we may miss all the calls where work rank 0 is not rank 0 of the communicator used for the alltoallv calls.
Right now, the assumption here is that we do not deal with alltoallv calls on subcommunicators, otherwise we would need to deal with jobid
and rank
when calling SaveBins()
. For now we set them to -1. Once it is fixed, we can simply the code of SaveBins
by removing the code specific to the case where the jobid
and rank
are not provided
At the moment, every time we start the server, the bins are recreated, which takes time with a large dataset. Avoid when resulting files already exist.
Extend the current script to be able to extract all data about a specific call(s)
Right now the metadata will give something like "[0-500]" which is actually incorrect because 0 indexed. It should be [0-499].
Right now, we have a Bcast to send a unique ID to ranks to make it easier to identify files. Use SLURM_JOB_ID and document that even when Slurm is not used, the user is responsible for setting it up in order to identify files.
Right now we have a static size of MAX_TRACKED_CALLS when tracking calls in the context of counts. If we reach the max, we display a message but I need to switch to dynamic arrays that are automatically increased
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.