Coder Social home page Coder Social logo

Comments (9)

nicolemarsaglia avatar nicolemarsaglia commented on June 9, 2024 1

Thanks for the info! I'm a tad sick so I'm taking the rest of the day (sorry!), but I can get back to this on Monday.

from ascent.

cyrush avatar cyrush commented on June 9, 2024 1

Thanks for confirming this behavior, we will work to resolve these limits for binning.

from ascent.

nicolemarsaglia avatar nicolemarsaglia commented on June 9, 2024

Hey @BenWibking sorry for the delay.

A reproducer would be great and I can do my best to try to help you figure out this issue!

from ascent.

BenWibking avatar BenWibking commented on June 9, 2024

Thanks. I've put a reproducer here: parthenon-hpc-lab/athenapk#49. Let me know what you find.

from ascent.

BenWibking avatar BenWibking commented on June 9, 2024

@nicolemarsaglia I've rebuilt Ascent + TPLs with debugging info and I get a more informative backtrace. The segmentation fault happens here:

(cuda-gdb) bt
#0  ascent::runtime::expressions::binning (dataset=..., bin_axes=..., reduction_var="Density", reduction_op="avg", empty_bin_val=0,
    component="") at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/ascent/runtimes/expressions/ascent_blueprint_architect.cpp:1649
#1  0x00007fb75d836285 in ascent::runtime::expressions::binning_interface (reduction_var="Density", reduction_op="avg",
    n_empty_bin_val=..., n_component=..., n_axis_list=..., dataset=..., n_binning=..., n_output_axes=...)
    at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/ascent/runtimes/expressions/ascent_expression_filters.cpp:3158
#2  0x00007fb75d836c1a in ascent::runtime::expressions::Binning::execute (this=0xc40cdc0)
    at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/ascent/runtimes/expressions/ascent_expression_filters.cpp:3197
#3  0x00007fb75d0b9ba9 in flow::Workspace::execute (this=0x7ffd1f630098)
    at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/flow/flow_workspace.cpp:303
#4  0x00007fb75d7423a7 in ascent::runtime::expressions::ExpressionEval::evaluate (this=0x7ffd1f630030,
    expr="binning('Density','avg', [axis('x',[-0.5,0.5]), axis('y', [-0.5,0.5]), axis('z', num_bins=64)])",
    expr_name="avg_density_profile")
    at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/ascent/runtimes/ascent_expression_eval.cpp:1534
#5  0x00007fb75d927c35 in ascent::runtime::filters::BasicQuery::execute (this=0xa08c460)
    at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/ascent/runtimes/flow_filters/ascent_runtime_query_filters.cpp:127
#6  0x00007fb75d0b9ba9 in flow::Workspace::execute (this=0x7df7460)
    at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/flow/flow_workspace.cpp:303
#7  0x00007fb75d6fba4f in ascent::AscentRuntime::Execute (this=0x7df6ec0, actions=...)
    at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/ascent/runtimes/ascent_main_runtime.cpp:1831
#8  0x00007fb75d6e1915 in ascent::Ascent::execute (this=0x7ffd1f6318d0, actions=...)
    at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/ascent/ascent.cpp:410
#9  0x000000000085ba78 in parthenon::AscentOutput::WriteOutputFile(parthenon::Mesh*, parthenon::ParameterInput*, parthenon::SimTime*, parthenon::SignalHandler::OutputSignal) ()
#10 0x0000000000777e7e in parthenon::Outputs::MakeOutputs(parthenon::Mesh*, parthenon::ParameterInput*, parthenon::SimTime*, parthenon::SignalHandler::OutputSignal) ()
#11 0x00000000006cfcb1 in parthenon::EvolutionDriver::Execute() ()
#12 0x00000000004419df in main ()
(cuda-gdb) list
1644	//#endif
1645	        for(int i = 0; i < homes_size; ++i)
1646	        {
1647	          if(homes[i] != -1)
1648	          {
1649	            update_bin(bins, homes[i], values[i], reduction_op);
1650	          }
1651	        }
1652	      }
1653	    }

from ascent.

BenWibking avatar BenWibking commented on June 9, 2024

Here's info args:

(cuda-gdb) info args
dataset = @0xc2bec10: {m_parent = 0x0, m_schema = 0xc2bfb20, m_owns_schema = true,
  m_children = std::vector of length 44, capacity 64 = {0xc309760, 0xc2bd670, 0xc2bd710, 0xc30a490, 0xc30a270, 0xc2bf2b0, 0xc2bf430,
    0xc2c1530, 0xc30feb0, 0xc2c4f60, 0xc2c5f20, 0xc2c2130, 0xc2c5ca0, 0xc2c91c0, 0xc2c9d10, 0xc2c8f90, 0xc2c25e0, 0xc2cbba0, 0xc2cbd10,
    0xc2c4bb0, 0xc2ccc90, 0xc2c2c10, 0xc2c7590, 0xc2cabe0, 0xc2c7380, 0xc2d51b0, 0xc2d4000, 0xc2d7020, 0xc355040, 0xc2d8e00, 0xa10ea60,
    0xc2dab80, 0xc2d4e00, 0xc2da950, 0xb841190, 0xc2db6e0, 0xc2de550, 0xc2d8930, 0xc2d5ea0, 0xc2e17a0, 0xc2e25d0, 0xc3d98b0, 0xc2e42f0,
    0xc2e5290}, m_data = 0x0, m_data_size = 0, m_alloced = false, m_mmaped = false, m_mmap = 0x0, m_allocator_id = 0}
bin_axes = @0x7ffd1f62ea80: {m_parent = 0x0, m_schema = 0xc413880, m_owns_schema = true,
  m_children = std::vector of length 3, capacity 4 = {0xc40fb70, 0xc40f570, 0xc40fa70}, m_data = 0x0, m_data_size = 0,
  m_alloced = false, m_mmaped = false, m_mmap = 0x0, m_allocator_id = 0}
reduction_var = "Density"
reduction_op = "avg"
empty_bin_val = 0
component = ""

And info locals:

(cuda-gdb) info locals
i = 26
values = {m_data = 0x7fb6ee6f9280, m_dtype = {m_id = 12, m_num_ele = 1728, m_offset = 0, m_stride = 8, m_ele_bytes = 8,
    m_endianness = 0}}
comp_path = ""
values_path = "fields/Density/values"
dom = @0xc309760: {m_parent = 0xc2bec10, m_schema = 0xc3096f0, m_owns_schema = false,
  m_children = std::vector of length 4, capacity 4 = {0xc309910, 0xc30a720, 0xc30ada0, 0xc30c280}, m_data = 0x0, m_data_size = 0,
  m_alloced = false, m_mmaped = false, m_mmap = 0x0, m_allocator_id = 0}
n_homes = {m_parent = 0x0, m_schema = 0xc413bc0, m_owns_schema = true, m_children = std::vector of length 0, capacity 0,
  m_data = 0xc4141f0, m_data_size = 6912, m_alloced = true, m_mmaped = false, m_mmap = 0x0, m_allocator_id = 0}
homes = 0xc4141f0
homes_size = 1728
dom_index = 0
var_names = std::vector of length 4, capacity 6 = {"x", "y", "z", "Density"}
topo_and_assoc = @0x7ffd1f62d800: {m_parent = 0x0, m_schema = 0xc411040, m_owns_schema = true,
  m_children = std::vector of length 2, capacity 2 = {0xc2cc2e0, 0xc410de0}, m_data = 0x0, m_data_size = 0, m_alloced = false,
  m_mmaped = false, m_mmap = 0x0, m_allocator_id = 0}
topo_name = "topo"
assoc_str = "element"
bounds = @0x7ffd1f62d760: {m_parent = 0x0, m_schema = 0xc4112a0, m_owns_schema = true,
  m_children = std::vector of length 2, capacity 2 = {0xc410fe0, 0xc413820}, m_data = 0x0, m_data_size = 0, m_alloced = false,
  m_mmaped = false, m_mmap = 0x0, m_allocator_id = 0}
min_coords = 0xc2c4370
max_coords = 0xc2c0b20
axes = {{"x", "i", "dx"}, {"y", "j", "dy"}, {"z", "k", "dz"}}
num_axes = 3
num_bins = 64
num_bin_vars = 2
bins_size = 128
bins = 0x6e99a90
mpi_comm = 0x7ffd1f62e240
global_bins = 0x10000000c2bec10
res = {m_parent = 0x0, m_schema = 0xc413880, m_owns_schema = false,
  m_children = std::vector of length -17553172076876, capacity -17553146376764 = {0x458b48c389481aeb, 0xf528ede8c78948e8,
    0xe8c78948d88948ff, 0xf85d8b48fff59272, 0x4853e5894855c3c9, 0x48e87d894818ec83, 0x48e8458b48e07589, 0x48fff4b0ffe8c789,
    0x53e8c78948e8458b, 0x48e0558b48fff4a6, 0x8948d68948e8458b, 0x1aebfff58e20e8c7, 0x48e8458b48c38948, 0x48fff5288fe8c789,
    0x9214e8c78948d889, 0xc3c9f85d8b48fff5, 0xec834853e5894855, 0x758948e87d894818, 0xc78948e8458b48e0, 0x458b48fff4b0a1e8,
    0xf4a5f5e8c78948e8, 0x458b48e0558b48ff, 0xe8c78948d68948e8, 0x89481aebfff49e32, 0xc78948e8458b48c3, 0xd88948fff52831e8,
    0xfff591b6e8c78948, 0x4855c3c9f85d8b48, 0x4848ec834853e589, 0x48b0758948b87d89, 0x43e8c78948b8458b, 0x48b8458b48fff4b0,
    0x48fff4a597e8c789, 0x1be8c78948ef458d, 0x48ef558d48fff592, 0x48c0458d48b04d8b, 0x4d94e8c78948ce89, 0x8b48c0558d48fff5,
    0xc78948d68948b845, 0x458d48fff49db1e8, 0xf4d5f5e8c78948c0, 0xc78948ef458d48ff, 0x483cebfff51dd9e8, 0x8948c0458d48c389,
    0x3ebfff4d5d8e8c7, 0x48ef458d48c38948, 0xebfff51db7e8c789, 0xb8458b48c3894803, 0xfff52776e8c78948, 0xfbe8c78948d88948,
    0xc9f85d8b48fff590, 0x8348e589485590c3, 0x8b48f87d894810ec, 0x5a9ce8c78948f845, 0x8948f8458b48fff5, 0xc990fff52740e8c7,
    0x8348e589485590c3, 0x8b48f87d894810ec, 0x5a74e8c78948f845, 0x485590c3c990fff5, 0xec8348535441e589, 0x758948b87d894840,
    0xc78948b8458b48b0, 0xef45c6fff483d1e8, 0xc78948b0458b4800, 0x458948fff4fe81e8, 0x5e7501d87d8348d8, 0xe8c78948b8458b48, 0x1ef45c6fff4ab0a, 0xe8c78948b0458b48, 0x48c38948fff4fb4a, 0xdbe8c78948b8458b, 0x8948de8948fff4bf, 0x8b48fff561a0e8c7, 0x94e4e8c78948b045, 0x458b48c38948fff5, 0xf48ae5e8c78948b8, 0xe8c78948de8948ff, 0x83482cebfff5196a, 0x458b48127502d87d, 0xf4a845e8c78948b8, 0x4813eb01ef45c6ff, 0x48b8458b48b0558b, 0x68bce8c78948d689, 0x840f00ef7d80fff5, 0xb8458b48000000af, 0xfff4d756e8c78948, 0xb0458b48d0458948, 0xfff4f0a6e8c78948, 0xe045c748c8458948, 0x8b4856eb00000000, 0x8948c8458b48e055, 0xf4cf75e8c78948d6, 0x40bf208b4cff, 0x8948fff51148e800, 0xe8df8948e6894cc3, 0xc05d8948fff5186a, 0xb8558b48c0458b48, 0xc0558d4838508948, 0x48d68948d0458b48, 0x48fff556b7e8c789, 0xc8458b4801e04583, 0xfff571e6e8c78948, 0x84c0920fe0453948, 0xc4894916eb9375c0, 0xfff50a0ee8df8948, 0x33e8c78948e0894c, 0x40c4834890fff58f, 0x485590c35d5c415b, 0x894810ec8348e589, 0x8b48f0758948f87d, 0x824ce8c78948f845, 0x8948f8458b48fff4, 0x8b48fff48530e8c7, 0x8948f0558b48f845, 0xf5518de8c78948d6, 0xe5894855c3c990ff, 0xf87d894810ec8348, 0xf8458b48f0758948, 0xfff4820ee8c78948, 0xe8c78948f0458b48, 0x1f88348fff4fcc2, 0x480e74c084c0940f, 0x4be8c78948f8458b, 0x458b4823ebfff4a9, 0xf4fc9de8c78948f0, 0xc0940f02f88348ff, 0xf8458b480c74c084, 0xfff4a6c6e8c78948, 0xf0558b48f8458b48, 0x43e8c78948d68948, 0x4855c3c990fff567, 0x894810ec8348e589, 0x8b48f0758948f87d, 0x8194e8c78948f845, 0x8b48f0558b48fff4, 0xc78948d68948f845, 0xc3c990fff4fe71e8, 0x10ec8348e5894855, 0xf0758948f87d8948, 0xf0453b48f8458b48, 0x8b48f0558b481374, 0xc78948d68948f845, 0x458b48fff4d771e8, 0xe589485590c3c9f8, 0xf87d894810ec8348, 0xf0558b48f0758948, 0x48d68948f8458b48, 0x48fff589d7e8c789, 0x485590c3c9f8458b, 0x894810ec8348e589, 0x8b48f0758948f87d, 0x8948f8458b48f055, 0xf49a1de8c78948d6, 0x90c3c9f8458b48ff, 0x40ec8348e5894855, 0xf845c748c87d8948, 0xc8458b4800000000, 0xfff4fb96e8c78948, 0xf07d8348f0458948, 0x2f07d8348077401, 0x8948c8458b487275, 0x8948fff4ee58e8c7, 0x8948e8458b48e845, 0x8948fff486d8e8c7, 0xd8458d4827ebd845, 0xfff54356e8c78948, 0x95e8c78948008b48, 0x48f8450148ffffff, 0x2be8c78948d8458d, 0x48e8458b48fff519, 0x48fff5754fe8c789, 0x48e0558d48e04589, 0x8948d68948d8458d, 0xc084fff55598e8c7, 0xf07d834817ebb275, 0x48c8458b48107400, 0x48fff573dfe8c789, 0xc9f8458b48f84589, 0x8348e589485590c3, 0xc748c87d894840ec, 0x8b4800000000f845, 0xfad4e8c78948c845, 0x8348f0458948fff4, 0x7d8348077401f07d, 0xc8458b48727502f0, 0xfff4ed96e8c78948, 0xe8458b48e8458948, 0xfff48616e8c78948, 0x8d4827ebd8458948, 0x4294e8c78948d845, 0xc78948008b48fff5, 0x450148ffffff95e8, 0xc78948d8458d48f8, 0x458b48fff51869e8, 0xf5748de8c78948e8, 0x558d48e0458948ff, 0xd68948d8458d48e0, 0xfff554d6e8c78948, 0x834817ebb275c084...}, m_data = 0x7ffd1f630130, m_data_size = 205600896, m_alloced = 48, m_mmaped = 234, m_mmap = 0x7fb754b039c7 <conduit::Node::init_defaults()+93>, m_allocator_id = 140725130029616}
res_bins = 0x26bbb40 <ompi_mpi_comm_world>

from ascent.

BenWibking avatar BenWibking commented on June 9, 2024

I've uploaded the core files here: https://cloudstor.aarnet.edu.au/plus/s/hTgYZQWYDYTPZn9

from ascent.

BenWibking avatar BenWibking commented on June 9, 2024

This is a very strange bug that I cannot reproduce on either Frontier or Summit. Somehow it appears to only happen on A100s.

from ascent.

BenWibking avatar BenWibking commented on June 9, 2024

Ok, I've traced the issue to the fact that the binning operation runs on the CPU and it attempts to dereference a device pointer, since our code sends the device-resident data to Ascent via zero-copy. This works on systems with unified memory, such as Summit and Frontier, but fails on systems without it.

from ascent.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.