mit-crpg / openmoc Goto Github PK
View Code? Open in Web Editor NEWA method of characteristics code for nuclear reactor physics calculations.
Home Page: https://mit-crpg.github.io/OpenMOC/
License: Other
A method of characteristics code for nuclear reactor physics calculations.
Home Page: https://mit-crpg.github.io/OpenMOC/
License: Other
Add a "void alignTrackSegments()" class method to the Track class. This method will be called for all tracks following segmentation, and will allocate arrays in which all track segments can be adjacently stored in memory. Finally, the method will cast the array as a std::vector<segment*> and replace the "_segments" Track class attribute with the new array.
Include option to include PAPI during compile time
e.g. --with-papi passed to build system
Use Cython to speedup Python plotting routines.
All the setCrossSectionByGroup type routines in Material.cpp use 0-based array indexing for the energy group. In other spots in the code, groups are indexed with 1-based indexing.
I can add -1's all over the place myself no problem, but I'm not sure if there are sample inputs (or real inputs?) that would break if we made this consistent.
The segmentation fault occurs with the default relaxation factor and mesh level. Using GDB, the segmentation fault appears to occur in the Mesh::splitCorners() method.
This may be a decent way to ramp up the BEAVRS UROP team to creating the BEAVRS 2D benchmark for OpenMOC.
Possible benchmarks include:
-How to write "input" files
-Brief description of each Solver type
-Reference OpenMOC API
-FAQ
Pin power computation was a hack a year ago to validate the C5G7 benchmark. It needs to be updated and the code made more readable.
In particular, the Solver classes should compute the power (fission rate) in each FSR after the source is converged. The Geometry::computePinPowers(...) Geometry::computePinPowersInUnviverse(...) methods should be extracted in the openmoc.process Python module.
The recursive algorithm used to collect FSR powers into pin powers can be retained, but it should be made more transparent to new developers. In addition, the routine should store the pin powers in an HDF5 file with tags specifiying the lattice ID, lattice cell coordinates, etc for easy data retrieval. Finally, the openmoc.plotter module should be extended with a routine to plot pin powers.
Write a h5py-enablied Python module to read and write performance data, including keff/iteration, residual/iteration, timer data, # segments, # tracks, # FSRs, # angles, # energy groups, etc. Support for ASCII datafiles for users without h5py.
This bug showed up after switching from matplotlib's pcolor routine to the imshow routine for speedy plotting.
The VectorizedSolver class (and its child VectorizedPrivateSolver) are inordinately slower than the non-vectorized solvers. This class should be faster than the corresponding CPUSolver and ThreadPrivateSolver classes, but it is not. We need to use PAPI profiling to figure out the percentage of vector instructions to total instructions issued, and determine what the bottleneck on performance is.
When trying to install OpenMOC on a mac, I ran into some issues with multiple versions of python being installed on my computer. All macs come with a "system" version of python that has been built with gcc 4.2 (as of early 2013). gcc 4.2 does not support the C++ 2011 standard, which is needed to install OpenMOC, so I first downloaded gcc 4.8 with MacPorts. Subsequently, I downloaded the newest version of python with MacPorts and all the required modules (setuptools, matplotlib, numpy, etc.) and linked them with the MacPorts python. From there I ran into several issues installing OpenMOC:
ld: unknown option: -soname
collect2: error: ld returned 1 exit status
error: command 'g++' failed with exit status 1
I followed the directions from various forums and replaced -soname with -dylib_install_name and -shared with -dynamiclib. Also, as suggested, I added the -lpython flag to the linker flags.
Fatal Python error: PyThreadState_Get: no current thread
Abort trap: 6
An good description of what is occurring is described here [http://stackoverflow.com/questions/15678153/homebrew-python-on-mac-os-x-10-8-fatal-python-error-pythreadstate-get-no-cu]. The macports version of python is being used to execute setup.py, but g++ is finding the python library associated with the system version of python while linking. This was remedied by adding a path to the macports python library file. Also, this library file was named libpython2.7.dylib, so the -lpython flag was changed to -lpython2.7.
Lesson to be learned: Compiling OpenMOC to a python module on a mac requires different compiler flags. Also, be cautious of what python installation is being used to compile OpenMOC. While one version of python might be used to call setup.py, the modules or libraries associated with another version can be included, which will result in the code erring out either during compilation or execution. If you get a fatal python error go into OpenMOC/build/lib and enter the terminal command:
otool -L _openmoc.so
This will tell you which python binary is being associated with the openmoc module. If this binary is not the same as the python binary being used in running OpenMOC, the code will error out. Also be aware that python binaries typically begin with a capital "P" while the python executable begins with a lowercase "p".
-Overall code structure
-Coding style conventions
-Configuration management and how to add new source code or compilation options
-Wrapping code with SWIG and SWIG configuration files
Figure out whether to support nested counting of PAPI events (or even if this is viable)
If not possible (or too cumbersome/tricky), error handling so that only non-nested counting is occuring
Here's my modified version of computeFSRSources. It was originally intended to access only the nonzeros in the scattering matrix for each material but this proved to be too time consuming, oddly enough. This was the reason behind using index, n_mat and nnzs. It is now simplified to only calculate scattering below a cutoff representing the highest incoming group which is scattered. This cutoff is stored in nnzs as the code loops through all outgoing groups.
Also, four different index variables are introduced to reduce the total number of multiplications that have to be done within each loop to retrieve information for various quantities.
Let me know if you have any questions, or see any possible problems with this code. Thanks!
FP_PRECISION CPUSolver::computeFSRSourcesNew() {
int tid;
int flx_ind;
int sig_ind;
int fg_ind;
int t_ind;
int n_mat;
int nnzs;
FP_PRECISION* nu_sigma_f;
FP_PRECISION* sigma_s;
FP_PRECISION* sigma_t;
FP_PRECISION* chi;
Material* material;
FP_PRECISION scatter_source;
FP_PRECISION fission_source;
FP_PRECISION temp;
FP_PRECISION source_residual = 0.0;
FP_PRECISION inv_k_eff = (1.0 / _k_eff); // Precalculate inverse of k
std::vector< std::vector<int> > index;
index = _geometry->getEnergyBounds_fg();
FP_PRECISION* cell_res = new FP_PRECISION[_num_FSRs];
FP_PRECISION* group_res = new FP_PRECISION[_num_threads*_num_groups];
/* _scatter_sources was changed from [_num_FSRs*_num_groups] to
[_num_threads*_num_groups] since storing for each cell is unnecessary */
/* For all regions, find the source */
#pragma omp parallel for private(tid, flx_ind, sig_ind, fg_ind, \
t_ind, n_mat, nnzs, material, nu_sigma_f, chi, sigma_s, sigma_t,\
fission_source, scatter_source, temp) schedule(guided)
for (int r=0; r < _num_FSRs; r++) {
tid = omp_get_thread_num();
material = _FSR_materials[r];
nu_sigma_f = material->getNuSigmaF();
chi = material->getChi();
sigma_s = material->getSigmaS();
sigma_t = material->getSigmaT();
n_mat = material->getUid();
/* Precalculate indexes only dependendent on cell, thread or material */
flx_ind = r*_num_groups; // Index to access current cell
fg_ind = n_mat*_num_groups; // Index to access material
t_ind = tid*_num_groups; // Index to access threaded _scatter_sources and group_res
/* Compute fission source */
if (!(material->isFissionable())) {
fission_source = 0.0;
}
else {
for (int g=0; g < _num_groups; g++)
_fission_sources[flx_ind + g] = _scalar_flux[flx_ind + g] * nu_sigma_f[g];
fission_source = inv_k_eff * pairwise_sum<FP_PRECISION>(&_fission_sources[flx_ind],
_num_groups);
}
/* Compute total scattering source for group g */
for (int g=0; g < _num_groups; g++) {
sig_ind = g*_num_groups; //Index for access cross section for outgoing group g
nnzs = index[fg_ind+g].back()+1;
for (int gp = 0; gp < nnzs; gp++) {
_scatter_sources[t_ind + gp] = sigma_s[sig_ind + gp]*_scalar_flux[flx_ind + gp];
}
scatter_source = pairwise_sum<FP_PRECISION>(&_scatter_sources[t_ind],
nnzs);
/* Set the total source for region r in group G */
_source[flx_ind + g] = (chi[g]*fission_source + scatter_source) * ONE_OVER_FOUR_PI;
_reduced_source[flx_ind + g] = _source[flx_ind + g] / sigma_t[g];
/* Compute the norm of residual of the source in the region, group */
if ( fabs(_source[flx_ind + g]) > (7E-10/_num_groups) ){
temp = (_source[flx_ind + g] - _old_source[flx_ind + g])/_source[flx_ind + g];
group_res[t_ind + g] = temp*temp;
}
/* Update the old source */
_old_source[flx_ind + g] = _source[flx_ind + g];
}
/* Sum all residuals stored in group_res in current thread */
cell_res[r] = pairwise_sum<FP_PRECISION>(&group_res[t_ind],_num_groups);
}
/* Sum up the residuals from each group and in each region */
source_residual = pairwise_sum<FP_PRECISION>(cell_res,_num_FSRs);
source_residual = sqrt(source_residual / _num_FSRs);
delete [] cell_res;
delete [] group_res;
log_printf(DEBUG,"ALL SOURCES AND RESIDUALS BUILT");
return source_residual;
}
Create input files for the BEAVRS benchmark geometry at the axial-midplane. Use 2-group and 8-group cross-section data generated from CASMO for the materials.
This issue will require fixing some ray tracing issues - namely, making the outermost boundary a circle (ie, the core vessel).
Since this is a class public method, the name of the method should better describe what is going on. From the name, users have no idea that this is related to CMFD. In addition, a "find" method which is exposed to the user (as a public method) would typically return something to the user. This doesn't appear to be the case for this method.
Either this method should be made private so that users don't unexpectedly call it from Python expecting one thing but getting another, or it should be renamed to something more descriptive.
Found this issue when trying to install with CUDA on Noah's machine.
Trying to allocate memory for the A and M CMFD matrices for a 10,000 cells and 361 groups caused the code to crash. Instead of allocating it at once, memory can be allocated on cell at a time. First, the types for A and M must be changed in CMFD.h
double** _M
double** _A
Then, in the constructor for CMFD,the memory allocation section can be changed to the following:
/* Memory allocated for each CMFD cell */
int size_M = _num_groups*_num_groups;
int size_A = _num_groups*(_num_groups+4);
int _num_cells = _cells_x*_cells_y;
/* Allocate pointers for each cell */
_M = new double* [_num_cells];
_A = new double* [_num_cells];
for ( int i = 0; i < _num_cells; i++ ) {
log_printf(DEBUG,"Allocating matrix memory for cell %d",i);
_M[i] = new double[size_M]; //Allocate M memory for one cell
_A[i] = new double[size_A]; //Allocate A memory for one cell
}
Then, in all the necessary functions, any calls to _M or _A must be changed accordingly. Here's an example:
In Cmfd::constructMatrices -
Before: _A[row*(_num_groups+4)+e+2]
After: _A[cell][e*(_num_groups+4)+e+2]
This fixed the issue for me.
Connect auto-generated Doxygen documentation to Sphinx using Breathe, and post to the gh-pages branch.
The CMFD module currently allows users to specify the number of groups to use in the CMFD solver. This can be anywhere from 1 to the number of groups in the MOC solver. If the number of CMFD groups is different than the number of MOC groups, the MOC group structure is cut up into coarse groups with each coarse group containing the same number of fine groups. If (num_MOC_groups % num_CMFD_groups != 0), the lowest group will be assigned the extra groups and thus grow by (num_MOC_groups % num_CMFD_groups) groups. This issue is to add the functionality to allow the user to specify the coarse mesh group structure and thereby achieve better spectral resolution in a coarse group CMFD solve.
Changing the config.py for each machine leads to merge conflicts when pushing/pulling from different machines. We should make the most commonly changed options command line options to reduce this issue.
Several of the LRA sample input files end in errors of the following type:
[ ERROR ] Unable to set diffusion coefficient for group 0 for Material 2
[ ERROR ] ... which contains 6 energy groups
[ ERROR ] Unable to set diffusion coefficient for group 0 for Material 2
[ ERROR ] ... which contains 6936 energy groups
Also, we should probably consolidate all of the different LRA example input files into just a single file. All of the differentiations of it are probably only useful for advanced users interested in that specific problem.
The Timer class may produce inconsistent for results runs > 5 seconds. In particular, the time reported for the transport sweep is greater than the total runtime. This seems to be due to compounded roundoff error in adding up the runtime for each transport sweep.
The following changes to how _thread_flux is implemented make it so that the memory for each thread is allocated one at a time instead of all at once.
in ThreadPrivateSolver.h, definition of _thread_flux should be changed to
FP_PRECISION** _thread_flux;
In ThreadPrivateSolver.cpp, the following changes should be made
void ThreadPrivateSolver::initializeFluxArrays() {
CPUSolver::initializeFluxArrays();
/* Delete old flux arrays if they exist */
if (_thread_flux != NULL)
delete [] _thread_flux;
int size;
/* Allocate memory for the flux and leakage arrays */
try{
size = _num_FSRs * _num_groups;
/* Allocate pointers for each thread scalar flux */
_thread_flux = new FP_PRECISION* [_num_threads];
for ( int thread = 0; thread < _num_threads; thread++ ) {
log_printf(DEBUG,"Allocating flux memory for thread %d",thread);
_thread_flux[thread] = new FP_PRECISION[size]; //Allocate fluxes for one thread
log_printf(DEBUG,"Location of allocated memory is %X",_thread_flux[thread]);
}
/* Allocate a thread local array of FSR scalar fluxes */
//size = _num_threads * _num_FSRs * _num_groups;
//_thread_flux = new FP_PRECISION[size];
}
catch(std::exception &e) {
log_printf(ERROR, "Could not allocate memory for the solver's fluxes. "
"Backtrace:%s", e.what());
}
log_printf(NORMAL,"Size of thread scalar flux is %d", _num_threads*size);
}
This means that all functions in ThreadPrivateSolver using _thread_flux must be modified too. Here is an example:
void ThreadPrivateSolver::flattenFSRFluxes(FP_PRECISION value) {
CPUSolver::flattenFSRFluxes(value);
int temp_ind;
/* Flatten the thread private flat source region scalar flux array */
#pragma omp parallel for private(temp_ind) schedule(guided)
for (int tid=0; tid < _num_threads; tid++) {
for (int r=0; r < _num_FSRs; r++) {
temp_ind = (r)*_num_groups;
for (int e=0; e < _num_groups; e++) {
_thread_flux[tid][temp_ind + e] = 0.0;
}
}
}
return;
}
The python-dev package appears to be a required dependency.
Also, here are the modifications I made to computeKeff. I created separate loops for fission and absorption so I could reuse cell_rates and group_rates without creating two separate ones, but doubling the memory may be worth it if you want it to go faster. My changes are a bit more straight forward than for computeFSRSources, but let me know if you have any questions or you see an issue.
void CPUSolver::computeKeff() {
Material* material;
FP_PRECISION* sigma_a;
FP_PRECISION* nu_sigma_f;
FP_PRECISION volume;
FP_PRECISION ScaleLeak;
FP_PRECISION tot_abs = 0.0;
FP_PRECISION tot_fission = 0.0;
int tid;
int t_ind;
int flx_ind;
FP_PRECISION* cell_rates = new FP_PRECISION[_num_FSRs];
FP_PRECISION* group_rates = new FP_PRECISION[_num_threads*_num_groups];
/* Loop over all flat source regions and compute the volume-weighted
* absorption rates */
#pragma omp parallel for private(tid, t_ind, flx_ind, volume, material, \
sigma_a, nu_sigma_f) schedule(guided)
for (int r=0; r < _num_FSRs; r++) {
tid = omp_get_thread_num();
volume = _FSR_volumes[r];
material = _FSR_materials[r];
sigma_a = material->getSigmaA();
nu_sigma_f = material->getNuSigmaF();
flx_ind = r*_num_groups;
t_ind = tid*_num_groups;
for (int e=0; e < _num_groups; e++) {
group_rates[t_ind + e] = sigma_a[e] * _scalar_flux[flx_ind + e];
}
/* Reduce absorption across all groups for current cell */
cell_rates[r] = pairwise_sum<FP_PRECISION>(&group_rates[t_ind], _num_groups) * volume;
}
/* Reduce absorption across FSRs */
tot_abs = pairwise_sum<FP_PRECISION>(cell_rates, _num_FSRs);
/* Loop over all flat source regions and compute the volume-weighted
* fission rates */
#pragma omp parallel for private(tid, t_ind, flx_ind, volume, material, \
sigma_a, nu_sigma_f) schedule(guided)
for (int r=0; r < _num_FSRs; r++) {
tid = omp_get_thread_num();
volume = _FSR_volumes[r];
material = _FSR_materials[r];
sigma_a = material->getSigmaA();
nu_sigma_f = material->getNuSigmaF();
flx_ind = r*_num_groups;
t_ind = tid*_num_groups;
for (int e=0; e < _num_groups; e++) {
group_rates[t_ind + e] = nu_sigma_f[e] * _scalar_flux[flx_ind + e];
}
/* Reduce fission across all groups for current cell */
cell_rates[r] = pairwise_sum<FP_PRECISION>(&group_rates[t_ind], _num_groups) * volume;
}
/* Reduce fission across all FSRs */
tot_fission = pairwise_sum<FP_PRECISION>(cell_rates, _num_FSRs);
/* This is my stuff, don't worry about it ;) */
if (_mgtype == 0) {
ScaleLeak = 1.0;
}
else {
ScaleLeak = _geometry->getLeakageScalingFactor();
}
/** Reduce leakage array across tracks, energy groups, polar angles */
int size = 2 * _tot_num_tracks * _polar_times_groups;
_leakage = pairwise_sum<FP_PRECISION>(_boundary_leakage, size) * 0.5;
_leakage *= ScaleLeak;
_k_eff = tot_fission / (tot_abs + _leakage);
log_printf(DEBUG, "abs = %f, fission = %f, leakage = %f, "
"k_eff = %f", tot_abs, tot_fission, _leakage, _k_eff);
delete [] group_rates;
delete [] cell_rates;
return;
}
We should examine the scaling to determine the effect of hardware prefetching. Hardware prefetch can be set at compile time on Intel and IBM BGQ machines using the following compiler flags, where n is the number of cache lines (default setting is 2):
Intel: -opt-prefetch[=n]
IBM: -qprefetch=aggressive, -qprefetch=noaggressive, -qnoprefetch
The end result would be weak and strong scaling plots using the exp vs. table for different levels of prefetch aggressiveness (1,2,3,4 cache lines). This may help us explain the superlinear weak scaling, akin to arguments made in the Simon and McGalliard paper "Multi-Core Processor Memory Contention Benchmark Analysis Case Study."
The main transport sweep loop may see a performance boost from static scheduling where the chunk size is the total number of tracks in azimuthal halfspace divided by the number of threads.
This would be a mini-App for OpenMOC akin to XS Bench. The Mini-App would isolate the different kernels in OpenMOC and user to measure the performance of each one. This would be useful for measuring the power profile for the OpenMOC's kernels with finer code granularity than simply measuring the power profile for the entire code.
What do we need:
Runtime Options
I ran into some issues comparing the residuals of runs using different numbers of groups. The source residual for a 361 group problem starts at 19.0, sqrt(_num_groups), so setting the tolerance to 1E-5 actually results in converging the eigenvalue effectively to 1E-5/19.0. This seemed to be resolved by changing the source residual is definition in CPUSolver::computeFSRSources() as follows:
source_residual = sqrt(source_residual / (_num_FSRs * _num_groups) );
We can write a Python routine for a new "openmoc.perf" module which will set environment variables at runtime. This will allow us to set the OMP affinity and to explore different mappings across processors/sockets and to investigate NUMA and resource contention effects.
Develop a performance model of the solver based on the number of segments, groups, and flat source regions. Implement a Python module with routines to estimate the number of flops, loads and runtime given a instances of a Geometry and TrackGenerator object. This will be useful to compare predicted values with measured PAPI values.
In addition, it may be useful to write some C/C++ code and wrap it as a C extension for Python to measure and report key hardware characteristics, such as cache size and latency.
The code breaks with older versions of gcc, icpc, and bgxlc compilers due to the OpenMP parallel regions in Cmfd.cpp. One or more variable length C++ containers (std::vector or std::map) is declared as a private variable for these regions. The bug can be fixed by either using a fixed length container, type casting the container as an array pointer, or initializing a variable private to each thread to point to the container from within the parallel region.
CPUSolver.cpp has quite a few tabs rather than spaces, which leads to unexpected indentations in different text editors. I suggest tabs be eliminated in favor of spaces.
Also, after talking to Will, it sounds like the desired code convention moving forward is to use two spaces (rather than four spaces) for indenting.
The suite should test all classes and routines, and run through a series of benchmark problems using all different solver types and parameters:
Parameters:
Benchmarks:
An interesting study for looking at our superlinear weak scaling would be to localize points of contention on shared memory systems (L2, L3 caches, FSB, etc.) by running the code in parallel with different thread-to-core mappings. This can be done using OpenMP's affinity mechanisms - Intel's KMP_AFFINITY environment variable, or IBM's BG_THREADLAYOUT environment variable for Blue Gene.
A good reference for study is R. Hood's talk titled "Performance Impact of Resource Contention in Multicore Systems."
Solver class:
Attributes:
Methods:
setPAPIevent(char * event)
getPAPIcounts(char *event, int code_section)
clearPAPIevent(char* event)
Code Sections to Profile:
Kazutomo Yoshi at ANL may help us wit using an OPA API for inserting assembly language optimized atomic operations in place of the OpenMP atomics used for the scalar flux update in the CPUSolver. This may reduce the parallel scaling performance degradation that we see with this solver as compared to the ThreadPrivateSolver.
ThreadPrivateSolver and CPUSolver have optional Geometry and TrackGenerator arguments in the constructor (default to NULL), but OpenMOC seg faults if Geometry and TrackGenerator objects aren't supplied.
For spatial domain decomposition, we will need modular ray tracing for efficient communication of angular fluxes across domains. To give users the option of using cyclic ray tracing or modular ray tracing, we can can subclass the TrackGenerator class into a CyclicTrackGenerator and a ModularTrackGenerator class for each application.
Update online installation instructions using Sphinx and post to the gh-pages branch for web hosting by GitHub.
Let users know they need to install python-dev and not only the python binaries if they encounter
$ openmoc/openmoc_wrap.cpp:149:20: fatal error: Python.h: No such file or directory
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.