Coder Social home page Coder Social logo

numap's Introduction

Overview

numap is a Linux library dedicated to memory profiling based on hardware performance monitoring unit (PMU). The main objective for the library is to provide high level abstraction for:

  • Cores load requests sampling
  • Cores store requests sampling

Supported processors

Intel processors with family_model information (decimal notation)

  • Nehalem (06_26, 06_30, 06_31, 06_46)
  • Sandy Bridge (06_42, 06_45)
  • Westmere (06_37, 06_44, 06_44)
  • Ivy Bridge (06_58, 06_62)
  • Haswell (06_60, 06_63, 06_69, 06_70)
  • Broadwell (06_61, 06_71, 06_79, 06_86)
  • Kaby Lake (06_142, 06_158)
  • Sky Lake (06_94, 06_78)
  • Cannon Lake (06-102)
  • Ice Lake (06_126)

Not implemented Intel processors:

  • Knights Ferry (11_00)
  • Knights Corner (11_01)
  • Knights Mill (06_133)
  • Knights Landing (06_87)

AMD processors

  • On going development

Folders Organization

  • examples: contains some examples showing how to use numap.

  • include: contains numap headers

  • src: contains numap implementation files

  • Makefile: is a Makefile building both the library and the examples

Dependencies

  • libpfm4
  • libnuma

Howto: extend numap in ordre to take your processor model into account.

Intro

The goal is to tell numap which read/write events to use on a specific architecture. The get_archi function specifies for each architecture which events to use:

switch(archi_id) {
/* ... */
  case CPU_MODEL(6, 158):
  case CPU_MODEL(6, 142):
    snprintf(arch->name, 256, "Kaby Lake micro arch");
    snprintf(arch->sampling_read_event, 256, "MEM_TRANS_RETIRED:LOAD_LATENCY:ldlat=3");
    snprintf(arch->sampling_write_event, 256, "MEM_INST_RETIRED:ALL_STORES");
    break;

You can add a new architecture by adding a new case.

Getting the correct info

On the machine considered, type

less /proc/cpuinfo

This file contains info in the following form:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 45
model name      : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
stepping        : 7
microcode       : 0x710
cpu MHz         : 1339.121
cache size      : 15360 KB
physical id     : 0
siblings        : 12
core id         : 0
cpu cores       : 6
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips        : 4599.76
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

Amongst this info, you are interested in the lines "cpu family" and "model". Using them, you can add a new case:

case CPU_MODEL(cpu_family, model):

In our case, we get

case CPU_MODEL(06, 45):

In the Intel documentations, this will be noted as 06_2DH (H for ... hexa)

Now, open the Intel documentation called "64, IA, 32 Architectures Software Developer Manual", and search for the string FAMILY_MODEL (in our example 06_2D). This brings you, among others into a section of chapter 19. Chapter 19 is called Performance Monitoring Events. In our case, we find that 06_2DH is described in section 19.6 PERFORMANCE MONITORING EVENTS FOR 2ND GENERATION INTEL® CORETM I7-2XXX, INTEL® CORETM I5-2XXX, INTEL® CORETM I3-2XXX PROCESSOR SERIES

In the table provided in this section, find the lines corresponding to the requried info. In particular, in this example, we fill in the values for sampling_read_event and sampling_write_event. We leave out thos for counting_read_event and counting_write_event

.sampling_read_event

For the sampling of memory reads, you need something like:

| CDH | 01H | MEM_TRANS_RETIRED.LOAD_LATENCY  | Randomly sampled loads whose latency is above a user defined threshold. A small fraction of the overall loads are sampled due to randomization. PMC3 only. | Specify threshold in MSR 3F6H. |

.sampling_write_event

| CDH | 02H | MEM_TRANS_RETIRED.PRECISE_STORE  | Sample stores and collect precise store operation via PEBS record. PMC3 only. | See Section 18.9.4.3. |

Filling up numap's struct archi for your machine

On some architectures, the info provided in the general documentation is INCORRECT. To get the correct naming of the sampling_read_event, one can use the examples/showevtinfo program provided by numap. This program prints the list of available events.

For our example architecture, we find that the exact latency-fixing parameter is called LATENCY_ABOVE_THRESHOLD instead of LOAD_LATENCY. So be it!

Thus, we modify get_archi to add these lines:

  case CPU_MODEL(6, 45):
    snprintf(arch->name, 256, "Sandy Bridge micro arch");
    snprintf(arch->sampling_read_event, 256, "MEM_TRANS_RETIRED:LATENCY_ABOVE_THRESHOLD:ldlat=3");
    snprintf(arch->sampling_write_event, 256, "MEM_TRANS_RETIRED:PRECISE_STORE");
    break;

Testing

When this is done go to numap's root directory, type

$ cmake
$ make

Then try the example binary in examples:

$ examples/example

This program should output something looking like:

root@taurus-8 ~/numap:-)examples/example

Starting memory read sampling
Memory read sampling results

head = 192200 compared to max = 266240
Thread 0: 4805     samples
Thread 0: 4805     local cache 1                  100.000%
Thread 0: 0        local cache 2                  0.000%
Thread 0: 0        local cache 3                  0.000%
Thread 0: 0        local cache LFB                0.000%
Thread 0: 0        local memory                   0.000%
Thread 0: 0        remote cache or local memory   0.000%
Thread 0: 0        remote memory                  0.000%
Thread 0: 0        unknown l3 miss                0.000%

head = 193240 compared to max = 266240
Thread 1: 4831     samples
Thread 1: 4831     local cache 1                  100.000%
Thread 1: 0        local cache 2                  0.000%
Thread 1: 0        local cache 3                  0.000%
Thread 1: 0        local cache LFB                0.000%
Thread 1: 0        local memory                   0.000%
Thread 1: 0        remote cache or local memory   0.000%
Thread 1: 0        remote memory                  0.000%
Thread 1: 0        unknown l3 miss                0.000%

Starting memory write sampling
Memory write sampling results

head = 262112 compared to max = 266240
Thread 0: 6452     samples
Thread 0: 6442     local cache 1                  99.845%
Thread 0: 0        local cache 2                  0.000%
Thread 0: 0        local cache 3                  0.000%
Thread 0: 0        local cache LFB                0.000%
Thread 0: 0        local memory                   0.000%
Thread 0: 0        remote cache or local memory   0.000%
Thread 0: 0        remote memory                  0.000%
Thread 0: 0        unknown l3 miss                0.000%

head = 262136 compared to max = 266240
Thread 1: 6451     samples
Thread 1: 6436     local cache 1                  99.767%
Thread 1: 0        local cache 2                  0.000%
Thread 1: 0        local cache 3                  0.000%
Thread 1: 0        local cache LFB                0.000%
Thread 1: 0        local memory                   0.000%
Thread 1: 0        remote cache or local memory   0.000%
Thread 1: 0        remote memory                  0.000%
Thread 1: 0        unknown l3 miss                0.000%

Congrats, numap is set up for your machine!

Don't forget to push your modifications to github of course :)

numap's People

Contributors

clementfoyer avatar jklinkenberg avatar kwakwaouaite avatar madewink avatar manuelselva avatar trahay avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

numap's Issues

cpu support

I'm not familiar with cpu PMU so that could you please add the support to this cpu arch? here is cpu information:
Architecture: x86_64
Byte Order: Little Endian
Vendor ID: GenuineIntel
CPU family: 6
Model: 87
Model name: Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz

The intel doc mention it in 18.14 (https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html)
And here is another doc: https://software.intel.com/en-us/articles/intel-xeon-phi-x200-family-processor-performance-monitoring-reference-manual
Based on TABLE 2-4 in https://software.intel.com/sites/default/files/managed/6e/3d/Intel%C2%AE%20Xeon%20Phi%E2%84%A2%20Processor%20Performance%20Monitoring%20Reference%20Manual_Vol2_Mar2017.pdf i think it's MEM_UOPS_RETIRED:ALL_LOADS and MEM_UOPS_RETIRED:ALL_STORES
Thank you very much!

perf_event_open fails on Linux kernel < 4.1

When setting the parameters for perf_event_open, numap sets the use_clockid and clockid fields which were introduced in Linux kernel 4.1.

On older kernels, this makes to call to perf_event_open fail with the error code "Invalid argument".

We should detect this problem at compile time and, in case of an unsupported kernel:

  • make the compilation fail with an explicit message (ie. "kernel took old")
  • or, print a warning and don't use this feature, but this may break a few things

bug

check is

if (len <= 0) {

declaration is unsigned

size_t len;

snprintf gives how many would have written here

len -= snprintf(buf, len, "%s::%s", pinfo.name, info->name);

which then could write out of bounds. due to

for (u = 1; u < total; u++) {

probably no impact, thought I'd make an issue wither way

No support for memory-only NUMA domains

Currently the code crashes when dealing with systems that also contain memory-only NUMA domains.

This code is really problematic

    nb_numa_nodes = numa_num_configured_nodes();
    int nb_cpus = numa_num_configured_cpus();
    for (node = 0; node < nb_numa_nodes; node++) {
      struct bitmask *mask = numa_allocate_cpumask();
      numa_node_to_cpus(node, mask);
      numa_node_to_cpu[node] = -1;
      for (cpu = 0; cpu < nb_cpus; cpu++) {
        if (*(mask->maskp) & (1 << cpu)) {
          numa_node_to_cpu[node] = cpu;
          break;
        }
      }
      numa_bitmask_free(mask);
      if (numa_node_to_cpu[node] == -1) {
        nb_numa_nodes = -1; // to be handled properly
      }
    }

Especially,, as nb_numa_nodes is changed inside the loop that is still using it. Futher, nb_numa_nodes is unsigned. Setting it to -1 will result in an unexpected behavior.

Even if I solve this issue numap still does not work on e.g. the example

Numap does not support AMD processors

AMD processors provide Instruction Based Sampling that allows to samples instructions executed by the cpu. It could be used for collecting memory information in Numap.

If anyone is willing to port Numap on AMD, I can give advises on numap. My main problem is the lack of time :)

Using PERF_EVENT_IOC_REFRESH

It is possible to use PERF_EVENT_IOC_REFRESH so that when the sample buffer is full, a signal is delivered. It would be very useful if numap had an option to enable this.

For instance, a callback could be passed to numap_sampling_init_measure. If a callback is passed, then numap enables PERF_EVENT_IOC_REFRESH and calls the callback each time the buffer is full.

Another solution (that would not break the API), would be to add a new function (eg numap_sampling_add_callback) to enable this feature.

data_src implements

I read the source code in numap as:
"
int is_served_by_local_cache2(union perf_mem_data_src data_src) {
if (data_src.mem_lvl & PERF_MEM_LVL_HIT) {
if (data_src.mem_lvl & PERF_MEM_LVL_L2) {
return 1;
}
}
return 0;
}
"

But when i read the perf man page (http://man7.org/linux/man-pages/man2/perf_event_open.2.html), it says that "mem_lvl
Memory hierarchy level hit or miss, a bitwise com‐
bination of the following, shifted left by
PERF_MEM_LVL_SHIFT:
PERF_MEM_LVL_NA Not available
PERF_MEM_LVL_HIT Hit
"
So i think the code should be like this:
“int is_served_by_local_cache2(union perf_mem_data_src data_src) {
if ((data_src.mem_lvl >> PERF_MEM_LVL_SHIFT)& PERF_MEM_LVL_HIT) {
if ((data_src.mem_lvl >> PERF_MEM_LVL_SHIFT) & PERF_MEM_LVL_L2) {
return 1;
}
}
return 0;
}”

But i think numap runs correctly. So did i read the wrong documents?

Can't run example

I'm try to run the example on vmware with kernel version (Linux version 4.15.0-38-generic (buildd@lcy01-amd64-023) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #41-Ubuntu SMP Wed Oct 10 10:59:38 UTC 2018) and cpu (Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz)
i get the exception:
Starting memory read sampling -> numap_sampling_start error : perf_event ==> Operation not supported

Then I try to run the example on server with kernel version( Linux version 3.10.0-327.36.3.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon Oct 24 16:09:20 UTC 2016) and cpu (Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz)
and i get the exception:
Segmentation fault

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.