Comments (9)
Thanks for the info! I'm a tad sick so I'm taking the rest of the day (sorry!), but I can get back to this on Monday.
from ascent.
Thanks for confirming this behavior, we will work to resolve these limits for binning.
from ascent.
Hey @BenWibking sorry for the delay.
A reproducer would be great and I can do my best to try to help you figure out this issue!
from ascent.
Thanks. I've put a reproducer here: parthenon-hpc-lab/athenapk#49. Let me know what you find.
from ascent.
@nicolemarsaglia I've rebuilt Ascent + TPLs with debugging info and I get a more informative backtrace. The segmentation fault happens here:
(cuda-gdb) bt
#0 ascent::runtime::expressions::binning (dataset=..., bin_axes=..., reduction_var="Density", reduction_op="avg", empty_bin_val=0,
component="") at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/ascent/runtimes/expressions/ascent_blueprint_architect.cpp:1649
#1 0x00007fb75d836285 in ascent::runtime::expressions::binning_interface (reduction_var="Density", reduction_op="avg",
n_empty_bin_val=..., n_component=..., n_axis_list=..., dataset=..., n_binning=..., n_output_axes=...)
at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/ascent/runtimes/expressions/ascent_expression_filters.cpp:3158
#2 0x00007fb75d836c1a in ascent::runtime::expressions::Binning::execute (this=0xc40cdc0)
at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/ascent/runtimes/expressions/ascent_expression_filters.cpp:3197
#3 0x00007fb75d0b9ba9 in flow::Workspace::execute (this=0x7ffd1f630098)
at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/flow/flow_workspace.cpp:303
#4 0x00007fb75d7423a7 in ascent::runtime::expressions::ExpressionEval::evaluate (this=0x7ffd1f630030,
expr="binning('Density','avg', [axis('x',[-0.5,0.5]), axis('y', [-0.5,0.5]), axis('z', num_bins=64)])",
expr_name="avg_density_profile")
at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/ascent/runtimes/ascent_expression_eval.cpp:1534
#5 0x00007fb75d927c35 in ascent::runtime::filters::BasicQuery::execute (this=0xa08c460)
at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/ascent/runtimes/flow_filters/ascent_runtime_query_filters.cpp:127
#6 0x00007fb75d0b9ba9 in flow::Workspace::execute (this=0x7df7460)
at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/flow/flow_workspace.cpp:303
#7 0x00007fb75d6fba4f in ascent::AscentRuntime::Execute (this=0x7df6ec0, actions=...)
at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/ascent/runtimes/ascent_main_runtime.cpp:1831
#8 0x00007fb75d6e1915 in ascent::Ascent::execute (this=0x7ffd1f6318d0, actions=...)
at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/ascent/ascent.cpp:410
#9 0x000000000085ba78 in parthenon::AscentOutput::WriteOutputFile(parthenon::Mesh*, parthenon::ParameterInput*, parthenon::SimTime*, parthenon::SignalHandler::OutputSignal) ()
#10 0x0000000000777e7e in parthenon::Outputs::MakeOutputs(parthenon::Mesh*, parthenon::ParameterInput*, parthenon::SimTime*, parthenon::SignalHandler::OutputSignal) ()
#11 0x00000000006cfcb1 in parthenon::EvolutionDriver::Execute() ()
#12 0x00000000004419df in main ()
(cuda-gdb) list
1644 //#endif
1645 for(int i = 0; i < homes_size; ++i)
1646 {
1647 if(homes[i] != -1)
1648 {
1649 update_bin(bins, homes[i], values[i], reduction_op);
1650 }
1651 }
1652 }
1653 }
from ascent.
Here's info args
:
(cuda-gdb) info args
dataset = @0xc2bec10: {m_parent = 0x0, m_schema = 0xc2bfb20, m_owns_schema = true,
m_children = std::vector of length 44, capacity 64 = {0xc309760, 0xc2bd670, 0xc2bd710, 0xc30a490, 0xc30a270, 0xc2bf2b0, 0xc2bf430,
0xc2c1530, 0xc30feb0, 0xc2c4f60, 0xc2c5f20, 0xc2c2130, 0xc2c5ca0, 0xc2c91c0, 0xc2c9d10, 0xc2c8f90, 0xc2c25e0, 0xc2cbba0, 0xc2cbd10,
0xc2c4bb0, 0xc2ccc90, 0xc2c2c10, 0xc2c7590, 0xc2cabe0, 0xc2c7380, 0xc2d51b0, 0xc2d4000, 0xc2d7020, 0xc355040, 0xc2d8e00, 0xa10ea60,
0xc2dab80, 0xc2d4e00, 0xc2da950, 0xb841190, 0xc2db6e0, 0xc2de550, 0xc2d8930, 0xc2d5ea0, 0xc2e17a0, 0xc2e25d0, 0xc3d98b0, 0xc2e42f0,
0xc2e5290}, m_data = 0x0, m_data_size = 0, m_alloced = false, m_mmaped = false, m_mmap = 0x0, m_allocator_id = 0}
bin_axes = @0x7ffd1f62ea80: {m_parent = 0x0, m_schema = 0xc413880, m_owns_schema = true,
m_children = std::vector of length 3, capacity 4 = {0xc40fb70, 0xc40f570, 0xc40fa70}, m_data = 0x0, m_data_size = 0,
m_alloced = false, m_mmaped = false, m_mmap = 0x0, m_allocator_id = 0}
reduction_var = "Density"
reduction_op = "avg"
empty_bin_val = 0
component = ""
And info locals
:
(cuda-gdb) info locals
i = 26
values = {m_data = 0x7fb6ee6f9280, m_dtype = {m_id = 12, m_num_ele = 1728, m_offset = 0, m_stride = 8, m_ele_bytes = 8,
m_endianness = 0}}
comp_path = ""
values_path = "fields/Density/values"
dom = @0xc309760: {m_parent = 0xc2bec10, m_schema = 0xc3096f0, m_owns_schema = false,
m_children = std::vector of length 4, capacity 4 = {0xc309910, 0xc30a720, 0xc30ada0, 0xc30c280}, m_data = 0x0, m_data_size = 0,
m_alloced = false, m_mmaped = false, m_mmap = 0x0, m_allocator_id = 0}
n_homes = {m_parent = 0x0, m_schema = 0xc413bc0, m_owns_schema = true, m_children = std::vector of length 0, capacity 0,
m_data = 0xc4141f0, m_data_size = 6912, m_alloced = true, m_mmaped = false, m_mmap = 0x0, m_allocator_id = 0}
homes = 0xc4141f0
homes_size = 1728
dom_index = 0
var_names = std::vector of length 4, capacity 6 = {"x", "y", "z", "Density"}
topo_and_assoc = @0x7ffd1f62d800: {m_parent = 0x0, m_schema = 0xc411040, m_owns_schema = true,
m_children = std::vector of length 2, capacity 2 = {0xc2cc2e0, 0xc410de0}, m_data = 0x0, m_data_size = 0, m_alloced = false,
m_mmaped = false, m_mmap = 0x0, m_allocator_id = 0}
topo_name = "topo"
assoc_str = "element"
bounds = @0x7ffd1f62d760: {m_parent = 0x0, m_schema = 0xc4112a0, m_owns_schema = true,
m_children = std::vector of length 2, capacity 2 = {0xc410fe0, 0xc413820}, m_data = 0x0, m_data_size = 0, m_alloced = false,
m_mmaped = false, m_mmap = 0x0, m_allocator_id = 0}
min_coords = 0xc2c4370
max_coords = 0xc2c0b20
axes = {{"x", "i", "dx"}, {"y", "j", "dy"}, {"z", "k", "dz"}}
num_axes = 3
num_bins = 64
num_bin_vars = 2
bins_size = 128
bins = 0x6e99a90
mpi_comm = 0x7ffd1f62e240
global_bins = 0x10000000c2bec10
res = {m_parent = 0x0, m_schema = 0xc413880, m_owns_schema = false,
m_children = std::vector of length -17553172076876, capacity -17553146376764 = {0x458b48c389481aeb, 0xf528ede8c78948e8,
0xe8c78948d88948ff, 0xf85d8b48fff59272, 0x4853e5894855c3c9, 0x48e87d894818ec83, 0x48e8458b48e07589, 0x48fff4b0ffe8c789,
0x53e8c78948e8458b, 0x48e0558b48fff4a6, 0x8948d68948e8458b, 0x1aebfff58e20e8c7, 0x48e8458b48c38948, 0x48fff5288fe8c789,
0x9214e8c78948d889, 0xc3c9f85d8b48fff5, 0xec834853e5894855, 0x758948e87d894818, 0xc78948e8458b48e0, 0x458b48fff4b0a1e8,
0xf4a5f5e8c78948e8, 0x458b48e0558b48ff, 0xe8c78948d68948e8, 0x89481aebfff49e32, 0xc78948e8458b48c3, 0xd88948fff52831e8,
0xfff591b6e8c78948, 0x4855c3c9f85d8b48, 0x4848ec834853e589, 0x48b0758948b87d89, 0x43e8c78948b8458b, 0x48b8458b48fff4b0,
0x48fff4a597e8c789, 0x1be8c78948ef458d, 0x48ef558d48fff592, 0x48c0458d48b04d8b, 0x4d94e8c78948ce89, 0x8b48c0558d48fff5,
0xc78948d68948b845, 0x458d48fff49db1e8, 0xf4d5f5e8c78948c0, 0xc78948ef458d48ff, 0x483cebfff51dd9e8, 0x8948c0458d48c389,
0x3ebfff4d5d8e8c7, 0x48ef458d48c38948, 0xebfff51db7e8c789, 0xb8458b48c3894803, 0xfff52776e8c78948, 0xfbe8c78948d88948,
0xc9f85d8b48fff590, 0x8348e589485590c3, 0x8b48f87d894810ec, 0x5a9ce8c78948f845, 0x8948f8458b48fff5, 0xc990fff52740e8c7,
0x8348e589485590c3, 0x8b48f87d894810ec, 0x5a74e8c78948f845, 0x485590c3c990fff5, 0xec8348535441e589, 0x758948b87d894840,
0xc78948b8458b48b0, 0xef45c6fff483d1e8, 0xc78948b0458b4800, 0x458948fff4fe81e8, 0x5e7501d87d8348d8, 0xe8c78948b8458b48, 0x1ef45c6fff4ab0a, 0xe8c78948b0458b48, 0x48c38948fff4fb4a, 0xdbe8c78948b8458b, 0x8948de8948fff4bf, 0x8b48fff561a0e8c7, 0x94e4e8c78948b045, 0x458b48c38948fff5, 0xf48ae5e8c78948b8, 0xe8c78948de8948ff, 0x83482cebfff5196a, 0x458b48127502d87d, 0xf4a845e8c78948b8, 0x4813eb01ef45c6ff, 0x48b8458b48b0558b, 0x68bce8c78948d689, 0x840f00ef7d80fff5, 0xb8458b48000000af, 0xfff4d756e8c78948, 0xb0458b48d0458948, 0xfff4f0a6e8c78948, 0xe045c748c8458948, 0x8b4856eb00000000, 0x8948c8458b48e055, 0xf4cf75e8c78948d6, 0x40bf208b4cff, 0x8948fff51148e800, 0xe8df8948e6894cc3, 0xc05d8948fff5186a, 0xb8558b48c0458b48, 0xc0558d4838508948, 0x48d68948d0458b48, 0x48fff556b7e8c789, 0xc8458b4801e04583, 0xfff571e6e8c78948, 0x84c0920fe0453948, 0xc4894916eb9375c0, 0xfff50a0ee8df8948, 0x33e8c78948e0894c, 0x40c4834890fff58f, 0x485590c35d5c415b, 0x894810ec8348e589, 0x8b48f0758948f87d, 0x824ce8c78948f845, 0x8948f8458b48fff4, 0x8b48fff48530e8c7, 0x8948f0558b48f845, 0xf5518de8c78948d6, 0xe5894855c3c990ff, 0xf87d894810ec8348, 0xf8458b48f0758948, 0xfff4820ee8c78948, 0xe8c78948f0458b48, 0x1f88348fff4fcc2, 0x480e74c084c0940f, 0x4be8c78948f8458b, 0x458b4823ebfff4a9, 0xf4fc9de8c78948f0, 0xc0940f02f88348ff, 0xf8458b480c74c084, 0xfff4a6c6e8c78948, 0xf0558b48f8458b48, 0x43e8c78948d68948, 0x4855c3c990fff567, 0x894810ec8348e589, 0x8b48f0758948f87d, 0x8194e8c78948f845, 0x8b48f0558b48fff4, 0xc78948d68948f845, 0xc3c990fff4fe71e8, 0x10ec8348e5894855, 0xf0758948f87d8948, 0xf0453b48f8458b48, 0x8b48f0558b481374, 0xc78948d68948f845, 0x458b48fff4d771e8, 0xe589485590c3c9f8, 0xf87d894810ec8348, 0xf0558b48f0758948, 0x48d68948f8458b48, 0x48fff589d7e8c789, 0x485590c3c9f8458b, 0x894810ec8348e589, 0x8b48f0758948f87d, 0x8948f8458b48f055, 0xf49a1de8c78948d6, 0x90c3c9f8458b48ff, 0x40ec8348e5894855, 0xf845c748c87d8948, 0xc8458b4800000000, 0xfff4fb96e8c78948, 0xf07d8348f0458948, 0x2f07d8348077401, 0x8948c8458b487275, 0x8948fff4ee58e8c7, 0x8948e8458b48e845, 0x8948fff486d8e8c7, 0xd8458d4827ebd845, 0xfff54356e8c78948, 0x95e8c78948008b48, 0x48f8450148ffffff, 0x2be8c78948d8458d, 0x48e8458b48fff519, 0x48fff5754fe8c789, 0x48e0558d48e04589, 0x8948d68948d8458d, 0xc084fff55598e8c7, 0xf07d834817ebb275, 0x48c8458b48107400, 0x48fff573dfe8c789, 0xc9f8458b48f84589, 0x8348e589485590c3, 0xc748c87d894840ec, 0x8b4800000000f845, 0xfad4e8c78948c845, 0x8348f0458948fff4, 0x7d8348077401f07d, 0xc8458b48727502f0, 0xfff4ed96e8c78948, 0xe8458b48e8458948, 0xfff48616e8c78948, 0x8d4827ebd8458948, 0x4294e8c78948d845, 0xc78948008b48fff5, 0x450148ffffff95e8, 0xc78948d8458d48f8, 0x458b48fff51869e8, 0xf5748de8c78948e8, 0x558d48e0458948ff, 0xd68948d8458d48e0, 0xfff554d6e8c78948, 0x834817ebb275c084...}, m_data = 0x7ffd1f630130, m_data_size = 205600896, m_alloced = 48, m_mmaped = 234, m_mmap = 0x7fb754b039c7 <conduit::Node::init_defaults()+93>, m_allocator_id = 140725130029616}
res_bins = 0x26bbb40 <ompi_mpi_comm_world>
from ascent.
I've uploaded the core files here: https://cloudstor.aarnet.edu.au/plus/s/hTgYZQWYDYTPZn9
from ascent.
This is a very strange bug that I cannot reproduce on either Frontier or Summit. Somehow it appears to only happen on A100s.
from ascent.
Ok, I've traced the issue to the fact that the binning operation runs on the CPU and it attempts to dereference a device pointer, since our code sends the device-resident data to Ascent via zero-copy. This works on systems with unified memory, such as Summit and Frontier, but fails on systems without it.
from ascent.
Related Issues (20)
- execution backend selection improvements
- replay parallel multi-timestep mode
- allow reuse of past actions
- Are crinkle slices / subvolumes supported? HOT 8
- warpx unit test with image check
- add docs for preping data for ML consumption
- Slower than expected slice extracts HOT 2
- cycle or timestep prefixes for extracts HOT 4
- auto_camera fails with yaml configuration, expected DataType int32 HOT 2
- images not saved under default_dir HOT 2
- issues with data binning
- data binning `cnt` (old) vs `count` (new) HOT 1
- example docker container needs new strategy for py-h5py
- dealing with multiple topologies HOT 4
- surfaces are rendered in parallel that should not be visible HOT 4
- Compilation error - cray compilers HOT 1
- occasional data binning test failure
- add ability to clear expr + query history
- compile issue when dray stats are off?
- Found that 'name' var should be ofile_base, v0.9.3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ascent.