Coder Social home page Coder Social logo

madewink / numamma Goto Github PK

View Code? Open in Web Editor NEW

This project forked from numamma/numamma

0.0 0.0 0.0 4.98 MB

Fork of numamma, on which I worked during my internship, fully merged.

License: MIT License

CMake 5.19% Shell 5.09% Python 1.40% R 2.02% C 86.28% Makefile 0.03%

numamma's People

Contributors

kwakwaouaite avatar madewink avatar trahay avatar

numamma's Issues

Implement new numap interruption handling

Numap now has a way to prevent data loss when many samplings are recorded by sending regularly (based on an adjustable number of samples) an interruption and catching it, allowing the user to pass a handler that has access to the measure.
The goal is to implement this feature, and this is what needs to be done :

  • Write a handler that calls __copy_samples
  • Make __copy_samples_thread thread safe
  • Make sure we don't have duplicates (take data_tail into consideration)

First idea is to write a handler that would just call __copy_samples and adapt data_tail, since all the work is quite done there, but we have to find a way to pass the access_type argument, and thus access it in the handler (which only parameters can be a numap_sampling measure and a file descriptor).
Furthermore, it might need to create a function __copy_samples_thread, since __copy_samples loops on each tid, whereas we just want to copy the samples of the current tid. This second point should not be a problem though.

Invalid header size

It seems that while running, sometimes we encounter a "Invalid header size" problem. 49b9388 was supposed to fix it, but I faced it once after applying it. Even if it seems realy rare, I want to investigate this a bit.
Log of the time it happened :

$ SAMPLING_RATE=1000 mem_intercept ~/home/mad/stage/NPB3.3.1/NPB3.3-OMP/bin/bt.A.x

NumaMMA settings:
-----------------
Sampling rate: 1000
Match samples: 1
Buffer size: 131072
Alarm interval: 0 ms
Memory access analysis: offline
-----------------
[NumaMMA]  This program was compiled with -fPIE. It is mapped at address 0x5645ed066000
Found a global variable: completed.7325 (defined at ). base addr=0x5645ed079100, size=1
Found a global variable: constants_ (defined at ). base addr=0x5645ed079120, size=1272
Found a global variable: fields_ (defined at ). base addr=0x5645ed079640, size=45427200
Found a global variable: global_ (defined at ). base addr=0x5645ed079620, size=24
Found a global variable: sec.2316 (defined at ). base addr=0x5645ed0790e8, size=4
Found a global variable: tt_ (defined at ). base addr=0x5645ed077660, size=1024
Found a global variable: work_1d_ (defined at ). base addr=0x5645ed066000, size=6240
Found a global variable: work_lhs_ (defined at ). base addr=0x5645ed067860, size=65024


 NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark

 No input file inputbt.data. Using compiled defaults
 Size:   64x  64x  64
 Iterations:  200       dt:   0.0008000
 Number of available threads:     4

 Time step    1
 Time step   20
 Time step   40
 Time step   60
 Time step   80
Error: invalid header size = 0

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x7fb3440f5b90 in ???
#1  0x7fb3440f4dc5 in ???
#2  0x7fb343d0b83f in ???
        at /build/glibc-vjB4T1/glibc-2.28/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
#3  0x7fb343d0b7bb in __GI_raise
        at ../sysdeps/unix/sysv/linux/raise.c:51
#4  0x7fb343cf6534 in __GI_abort
        at /build/glibc-vjB4T1/glibc-2.28/stdlib/abort.c:79
#5  0x7fb344391b77 in __analyze_buffer
        at /home/mad/stage/numamma-build/numamma/src/mem_sampling.c:723
#6  0x7fb3443901ec in __copy_samples_thread
        at /home/mad/stage/numamma-build/numamma/src/mem_sampling.c:590
#7  0x7fb34438cf10 in numap_generic_handler
        at /home/mad/stage/numamma-build/numamma/src/mem_sampling.c:218
#8  0x7fb34438cf38 in numap_read_handler
        at /home/mad/stage/numamma-build/numamma/src/mem_sampling.c:222
#9  0x7fb343cc09c4 in ???
#10  0x7fb343ea772f in ???
        at /build/glibc-vjB4T1/glibc-2.28/nptl/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
#11  0x5645ed070e4c in binvcrhs_
        at /home/mad/stage/NPB3.3.1/NPB3.3-OMP/BT/solve_subs.f:230
#12  0x5645ed070020 in y_solve_._omp_fn.0
        at /home/mad/stage/NPB3.3.1/NPB3.3-OMP/BT/y_solve.f:351
#13  0x7fb343f2ca41 in ???
#14  0x5645ed070238 in y_solve_
        at /home/mad/stage/NPB3.3.1/NPB3.3-OMP/BT/y_solve.f:45
#15  0x5645ed06b577 in adi_
        at /home/mad/stage/NPB3.3.1/NPB3.3-OMP/BT/adi.f:13
#16  0x5645ed067b00 in bt
        at /home/mad/stage/NPB3.3.1/NPB3.3-OMP/BT/bt.f:157
#17  0x5645ed068214 in main
        at /home/mad/stage/NPB3.3.1/NPB3.3-OMP/BT/bt.f:213
../../numamma-build/numamma/build/src/mem_intercept: line 76: 22487 Aborted                 LD_PRELOAD=$ld_preload LD_LIBRARY_PATH=$ld_library_path $prog_name $*

Double free or corruption error

Sometimes when running numamma's mem_intercept on a program (here bt.A.x from Nasa's NAS Parallel Benchmarks), a "double free or corruption" error occurs.
It has been quite hard for me to get it to happen consistently (decreasing the sampling rate seems to make it happen more, but it also brings my laptop on its knees.
Here is the error message :

double free or corruption (fasttop)

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x7f0060f43b90 in ???
#1  0x7f0060f42dc5 in ???
#2  0x7f0060b5983f in ???
        at /build/glibc-vjB4T1/glibc-2.28/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
#3  0x7f0060b597bb in __GI_raise
        at ../sysdeps/unix/sysv/linux/raise.c:51
#4  0x7f0060b44534 in __GI_abort
        at /build/glibc-vjB4T1/glibc-2.28/stdlib/abort.c:79
#5  0x7f0060b9b507 in __libc_message
        at ../sysdeps/posix/libc_fatal.c:181
#6  0x7f0060ba1c19 in malloc_printerr
        at /build/glibc-vjB4T1/glibc-2.28/malloc/malloc.c:5341
#7  0x7f0060ba35d6 in _int_free
        at /build/glibc-vjB4T1/glibc-2.28/malloc/malloc.c:4258
#8  0x7f00611dbfa9 in free
        at /home/mad/stage/numamma-build/numamma/src/mem_intercept.c:295
#9  0x7f00611e0410 in mem_sampling_finalize
        at /home/mad/stage/numamma-build/numamma/src/mem_sampling.c:338
#10  0x7f00611e9e62 in ma_finalize
        at /home/mad/stage/numamma-build/numamma/src/mem_analyzer.c:1303
#11  0x7f00611dca3a in __memory_conclude
        at /home/mad/stage/numamma-build/numamma/src/mem_intercept.c:534
#12  0x7f00613086f5 in _dl_fini
        at /build/glibc-vjB4T1/glibc-2.28/elf/dl-fini.c:138
#13  0x7f0060b5bd8b in __run_exit_handlers
        at /build/glibc-vjB4T1/glibc-2.28/stdlib/exit.c:108
#14  0x7f0060b5beb9 in __GI_exit
        at /build/glibc-vjB4T1/glibc-2.28/stdlib/exit.c:139
#15  0x7f0060b460a1 in __libc_start_main
        at ../csu/libc-start.c:342
#16  0x55be8dd511e9 in ???
#17  0xffffffffffffffff in ???
./mem_intercept: line 76:  5866 Aborted                 (core dumped) LD_PRELOAD=$ld_preload LD_LIBRARY_PATH=$ld_library_path $prog_name $*

Here is the backtrace I obtain in gdb :

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f0060b44535 in __GI_abort () at abort.c:79
#2  0x00007f0060b9b508 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f0060ca628d "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007f0060ba1c1a in malloc_printerr (str=str@entry=0x7f0060ca7fb0 "double free or corruption (fasttop)") at malloc.c:5341
#4  0x00007f0060ba35d7 in _int_free (av=0x7f0054000020, p=0x7f00540c8990, have_lock=<optimized out>) at malloc.c:4258
#5  0x00007f00611dbfaa in free (ptr=0x7f00540c89e0) at /home/mad/stage/numamma-build/numamma/src/mem_intercept.c:295
#6  0x00007f00611e0411 in mem_sampling_finalize () at /home/mad/stage/numamma-build/numamma/src/mem_sampling.c:338
#7  0x00007f00611e9e63 in ma_finalize () at /home/mad/stage/numamma-build/numamma/src/mem_analyzer.c:1303
#8  0x00007f00611dca3b in __memory_conclude () at /home/mad/stage/numamma-build/numamma/src/mem_intercept.c:534
#9  0x00007f00613086f6 in _dl_fini () at dl-fini.c:138
#10 0x00007f0060b5bd8c in __run_exit_handlers (status=0, listp=0x7f0060cdd718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, 
    run_dtors=run_dtors@entry=true) at exit.c:108
#11 0x00007f0060b5beba in __GI_exit (status=<optimized out>) at exit.c:139
#12 0x00007f0060b460a2 in __libc_start_main (main=0x55be8dd521f6 <main>, argc=1, argv=0x7ffc51d96678, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7ffc51d96668) at ../csu/libc-start.c:342
#13 0x000055be8dd511ea in _start ()

Here is the address of the pointer sent to the free method in mem_sampling_finalize :

(gdb) frame 6
#6  0x00007f00611e0411 in mem_sampling_finalize () at /home/mad/stage/numamma-build/numamma/src/mem_sampling.c:338
338           free(prev->buffer);
(gdb) p prev->buffer
$3 = (struct perf_event_header *) 0x7f00540c89e0

Here is the address of the pointer sent to libfree in free :

(gdb) frame 5
#5  0x00007f00611dbfaa in free (ptr=0x7f00540c89e0) at /home/mad/stage/numamma-build/numamma/src/mem_intercept.c:295
295       libfree(p_block->p_ptr);
(gdb) p p_block ->p_ptr 
$4 = (void *) 0x7f00540c89a0

Before I launched it, I added a print to stdout on each calloc, realloc and free in mem_intercept.c, here is the output (at least the end, since it is really too long) :

DEBUG : free 0x7f00580c0f50 (p_ptr of 0x7f00580c0f50 in free(0x7f00580c0f90))
DEBUG : free 0x7f00500d2420 (p_ptr of 0x7f00500d2420 in free(0x7f00500d2460))
DEBUG : free 0x7f00500d23a0 (p_ptr of 0x7f00500d23a0 in free(0x7f00500d23e0))
DEBUG : free 0x7f00580c0a10 (p_ptr of 0x7f00580c0a10 in free(0x7f00580c0a50))
Analyzing sample buffer 380/476 [be96c2a41e2 - be96c2b18f6]. Total samples so far: 25357DEBUG : free 0x7f00540cb1c0 (p_ptr of 0x7f00540cb1c0 in free(0x7f00540cb200))
DEBUG : free 0x7f00500d2180 (p_ptr of 0x7f00500d2180 in free(0x7f00500d21c0))
DEBUG : free 0x7f00540cb140 (p_ptr of 0x7f00540cb140 in free(0x7f00540cb180))
DEBUG : free 0x7f00540caf00 (p_ptr of 0x7f00540caf00 in free(0x7f00540caf40))
DEBUG : free 0x7f00580c0590 (p_ptr of 0x7f00580c0590 in free(0x7f00580c05d0))
DEBUG : free 0x7f00500d1d00 (p_ptr of 0x7f00500d1d00 in free(0x7f00500d1d40))
DEBUG : free 0x7f00580c04b0 (p_ptr of 0x7f00580c04b0 in free(0x7f00580c04f0))
DEBUG : free 0x7f00500d1c20 (p_ptr of 0x7f00500d1c20 in free(0x7f00500d1c60))
DEBUG : free 0x7f00540caa80 (p_ptr of 0x7f00540caa80 in free(0x7f00540caac0))
DEBUG : free 0x7f00580c0030 (p_ptr of 0x7f00580c0030 in free(0x7f00580c0070))
Analyzing sample buffer 390/476 [be96c186a35 - be96c1955e3]. Total samples so far: 25368DEBUG : free 0x7f00580bffb0 (p_ptr of 0x7f00580bffb0 in free(0x7f00580bfff0))
DEBUG : free 0x7f00580bfd90 (p_ptr of 0x7f00580bfd90 in free(0x7f00580bfdd0))
DEBUG : free 0x7f00500d16c0 (p_ptr of 0x7f00500d16c0 in free(0x7f00500d1700))
DEBUG : free 0x7f00540ca5a0 (p_ptr of 0x7f00540ca5a0 in free(0x7f00540ca5e0))
DEBUG : free 0x7f00500d1580 (p_ptr of 0x7f00500d1580 in free(0x7f00500d15c0))
DEBUG : free 0x7f00540ca380 (p_ptr of 0x7f00540ca380 in free(0x7f00540ca3c0))
DEBUG : free 0x7f00580bf910 (p_ptr of 0x7f00580bf910 in free(0x7f00580bf950))
DEBUG : free 0x7f00580bf830 (p_ptr of 0x7f00580bf830 in free(0x7f00580bf870))
DEBUG : free 0x7f00500d1100 (p_ptr of 0x7f00500d1100 in free(0x7f00500d1140))
DEBUG : free 0x7f00540c9f00 (p_ptr of 0x7f00540c9f00 in free(0x7f00540c9f40))
Analyzing sample buffer 400/476 [be96c080bbf - be96c08e879]. Total samples so far: 25378DEBUG : free 0x7f00500d0e80 (p_ptr of 0x7f00500d0e80 in free(0x7f00500d0ec0))
DEBUG : free 0x7f00580bf410 (p_ptr of 0x7f00580bf410 in free(0x7f00580bf450))
DEBUG : free 0x7f00540c9a80 (p_ptr of 0x7f00540c9a80 in free(0x7f00540c9ac0))
DEBUG : free 0x7f00500d0a00 (p_ptr of 0x7f00500d0a00 in free(0x7f00500d0a40))
DEBUG : free 0x7f00540c9800 (p_ptr of 0x7f00540c9800 in free(0x7f00540c9840))
DEBUG : free 0x7f00500d0780 (p_ptr of 0x7f00500d0780 in free(0x7f00500d07c0))
DEBUG : free 0x7f00580bef30 (p_ptr of 0x7f00580bef30 in free(0x7f00580bef70))
DEBUG : free 0x7f00580bedd0 (p_ptr of 0x7f00580bedd0 in free(0x7f00580bee10))
DEBUG : free 0x7f00540c9440 (p_ptr of 0x7f00540c9440 in free(0x7f00540c9480))
DEBUG : free 0x7f00580be890 (p_ptr of 0x7f00580be890 in free(0x7f00580be8d0))

I searched it for the addresses above and they did not match any...
I guess if it has already been freed somewhere I have no idea where it can happen. It might also be corrupted, in which case I have no idea what is going on and why.

Samples not matching

Since lib symbols detection has been added (7e802b8 ish), the number of not matching samples alternates between two values, which are respectively approximatively 70% (up to 75 sometimes) and 2%.
Examples :

88590 samples (including 2214 samples that do not match a known memory buffer / 2.499153%)
105901 samples (including 2148 samples that do not match a known memory buffer / 2.028309%)
101488 samples (including 71959 samples that do not match a known memory buffer / 70.903946%)
77273 samples (including 54085 samples that do not match a known memory buffer / 69.992104%)

This is something I am quite unable to explain now. I dumped all unmatched samples in a file, indicating if it was found in an address range in /proc/pid/maps, and almost all are located in those ranges (some don't, but no more than ten). This is what I get (one in stack, one in heap, one in a lib, one in an anonymous region, and one unmatched) :

0x7fff7473c40c located in 7fff7471e000-7fff7473f000 rw-p 00000000 00:00 0                          [stack]
0x564d4d7e4088 located in 564d4d7e4000-564d4dabe000 rw-p 00000000 00:00 0                          [heap]
0x7f8b50961c40 located in 7f8b50961000-7f8b50963000 rw-p 001ba000 08:02 1311555                    /lib/x86_64-linux-gnu/libc-2.28.so
0x7f8b50475430 located in 7f8b4f340000-7f8b5047a000 rw-p 00000000 00:00 0
0xffff977936302b00 matching no address range in /proc/5897/maps

I launched many tests to gather some statistics, and made a script to count how much samples came from where, and here are two representative results :

98325 samples (including 69545 samples that do not match a known memory buffer / 70.729721%)
all : 21510
stack : 82
heap : 3
lib : 3

105901 samples (including 2148 samples that do not match a known memory buffer / 2.028309%)
all : 22319
stack : 93
heap : 3
lib : 1

The first line is the line printed by numamma's mem analyser, the next ones are the results of counting the occurences of those keywords by parsing a file of unmatched samples which has been through sort and uniq since there are some duplicates. It is quite interesting to note that the numbers don't match at all (samples are somtimes located two times, for example once in 564d4d7e4000-564d4dabe000 and once in 564d4d7e4000-564d4e408000, but I don't think it explains it, must be investigated). The samples that are not in stack, heap, and lib are mostly located in anonymous regions.

I then wanted to see if detecting more symbols would do the magic, because I was restraining to symbols with type STT_OBJECT and bind STB_GLOBAL, but when I tried to add only STT_FUNC type (still restraining to STB_BIND), numamma would freeze and be killed while browsing the list of memory buffers in the final analysis.
I then dumped all not registered symbols in a file with entries like this :

getenv
	addr : 0x7f172a0b1000
	size : 0
	type : 2 (STT_FUNC)
	bind : 1 (STB_GLOBAL)

I tried to match those symbols with the unmatched samples. I did it just once so the result may be irrelevant, but it was so disappointing, since one sample only matched :

0x7f172abb37c8 : located in mem_list_lock (0x7f172abb37c0)
mem_list_lock
	addr : 0x7f172abb37c0
	size : 40
	type : 1 (STT_OBJECT)
	bind : 0 (STB_LOCAL)
0x7f172abb37c8 located in 7f172aab4000-7f172abbc000 rw-p 00000000 00:00 0

That's all I have for now, I will keep investigating.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.