aws / aws-graviton-getting-started Goto Github PK

Helping developers to use AWS Graviton2, Graviton3, and Graviton4 processors which power the 6th, 7th, and 8th generation of Amazon EC2 instances (C6g[d], M6g[d], R6g[d], T4g, X2gd, C6gn, I4g, Im4gn, Is4gen, G5g, C7g[d][n], M7g[d], R7g[d], R8g).

Home Page: https://aws.amazon.com/ec2/graviton/

License: Other

C 8.83% Shell 35.30% Python 55.24% Dockerfile 0.11% JavaScript 0.52%

graviton2 arm graviton3 ec2 graviton3e graviton graviton4

aws-graviton-getting-started's Introduction

AWS Graviton Technical Guide

This repository provides technical guidance for users and developers using Amazon EC2 instances powered by AWS Graviton processors (including the latest generation Graviton4 processors). While it calls out specific features of the Graviton processors themselves, this repository is also generally useful for anyone running code on Arm-based systems.

Transitioning to Graviton

If you are new to Graviton and want to understand how to identify target workloads, how to plan a transition project, how to test your workloads on AWS Graviton and finally how deploy in production, please read the key considerations to take into account when transitioning workloads to AWS Graviton based Amazon EC2 instances.

Building for Graviton

Processor	Graviton2	Graviton3(E)	Graviton4
Instances	M6g/M6gd, C6g/C6gd/C6gn, R6g/R6gd, T4g, X2gd, G5g, and I4g/Im4gn/Is4gen	C7g/C7gd/C7gn, M7g/M7gd, R7g/R7gd, and Hpc7g	R8g
Core	Neoverse-N1	Neoverse-V1	Neoverse-V2
Frequency	2500MHz	2600MHz	2800MHz (2700MHz for 48xlarge)
Turbo supported	No	No	No
Software Optimization Guide (Instruction Throughput and Latency)	SWOG	SWOG	SWOG
Interconnect	CMN-600	CMN-650	CMN-700
Architecture revision	ARMv8.2-a	ARMv8.4-a	Armv9.0-a
Additional features	fp16, rcpc, dotprod, crypto	sve, rng, bf16, int8	sve2, sve-int8, sve-bf16, sve-bitperm, sve-crypto
Recommended `-mcpu` flag (more information)	`neoverse-n1`	`neoverse-512tvb`	`neoverse-512tvb`
RNG Instructions	No	Yes	Yes
SIMD instructions	2x Neon 128bit vectors	4x Neon 128bit vectors / 2x SVE 256bit	4x Neon/SVE 128bit vectors
LSE (atomic mem operations)	yes	yes	yes
Pointer Authentication	no	yes	yes
Branch Target Identification	no	no	yes
Cores	64	64	96 per socket (192 for 2-socket 48xlarge)
L1 cache (per core)	64kB inst / 64kB data	64kB inst / 64kB data	64kB inst / 64kB data
L2 cache (per core)	1MB	1MB	2MB
LLC (shared)	32MB	32MB	36MB
DRAM	8x DDR4	8x DDR5	12x DDR5
DDR Encryption	yes	yes	yes

Optimizing for Graviton

Please refer to optimizing for general debugging and profiling information. For detailed checklists on optimizing and debugging performance on Graviton, see our performance runbook.

Different architectures and systems have differing capabilities, which means some tools you might be familiar with on one architecture don't have equivalent on AWS Graviton. Documented Monitoring Tools with some of these utilities.

Recent software updates relevant to Graviton

There is a huge amount of activity in the Arm software ecosystem and improvements are being made on a daily basis. As a general rule later versions of compilers, language runtimes, and applications should be used whenever possible. The table below includes known recent changes to popular packages that improve performance (if you know of others please let us know).

Package	Version	Improvements
bazel	3.4.1+	Pre-built bazel binary for Graviton/Arm64. See below for installation.
Cassandra	4.0+	Supports running on Java/Corretto 11, improving overall performance
FFmpeg	6.0+	Improved performance of libswscale by 50% with better NEON vectorization which improves the performance and scalability of FFmpeg multi-threaded encoders. The changes are available in FFmpeg version 4.3, with further improvements to scaling and motion estimation available in 5.1. Additional improvements to both are available in 6. For encoding h.265, build with the master branch of x265 because the released version of 3.5 does not include important optimizations for Graviton. For more information about FFmpeg on Graviton, read the blog post on AWS Open Source Blog, Optimized Video Encoding with FFmpeg on AWS Graviton Processors.
HAProxy	2.4+	A serious bug was fixed. Additionally, building with `CPU=armv81` improves HAProxy performance by 4x so please rebuild your code with this flag.
MariaDB	10.4.14+	Default build now uses -moutline-atomics, general correctness bugs for Graviton fixed.
mongodb	4.2.15+ / 4.4.7+ / 5.0.0+	Improved performance on graviton, especially for internal JS engine. LSE support added in SERVER-56347.
MySQL	8.0.23+	Improved spinlock behavior, compiled with -moutline-atomics if compiler supports it.
PostgreSQL	15+	General scalability improvements plus additional improvements to spin-locks specifically for Arm64
.NET	5+	.NET 5 significantly improved performance for ARM64. Here's an associated AWS Blog with some performance results.
OpenH264	2.1.1+	Pre-built Cisco OpenH264 binary for Graviton/Arm64.
PCRE2	10.34+	Added NEON vectorization to PCRE's JIT to match first and pairs of characters. This may improve performance of matching by up to 8x. This fixed version of the library now is shipping with Ubuntu 20.04 and PHP 8.
PHP	7.4+	PHP 7.4 includes a number of performance improvements that increase perf by up to 30%
pip	19.3+	Enable installation of python wheel binaries on Graviton
PyTorch	2.0+	Optimize Inference latency and throughput on Graviton. AWS DLCs and python wheels are available.
ruby	3.0+	Enable arm64 optimizations that improve performance by as much as 40%. These changes have also been back-ported to the Ruby shipping with AmazonLinux2, Fedora, and Ubuntu 20.04.
Spark	3.0+	Supports running on Java/Corretto 11, improving overall performance.
zlib	1.2.8+	For the best performance on Graviton please use zlib-cloudflare.

Containers on Graviton

You can run Docker, Kubernetes, Amazon ECS, and Amazon EKS on Graviton. Amazon ECR supports multi-arch containers. Please refer to containers for information about running container-based workloads on Graviton.

Lambda on Graviton

AWS Lambda now allows you to configure new and existing functions to run on Arm-based AWS Graviton2 processors in addition to x86-based functions. Using this processor architecture option allows you to get up to 34% better price performance. Duration charges are 20 percent lower than the current pricing for x86 with millisecond granularity. This also applies to duration charges when using Provisioned Concurrency. Compute Savings Plans supports Lambda functions powered by Graviton2.

The Lambda page highlights some of the migration considerations and also provides some simple to deploy demos you can use to explore how to build and migrate to Lambda functions using Arm/Graviton2.

Operating Systems

Please check os.md for more information about which operating system to run on Graviton based instances.

Known issues and workarounds

Postgres

Postgres performance can be heavily impacted by not using LSE. Today, postgres binaries from distributions (e.g. Ubuntu) are not built with -moutline-atomics or -march=armv8.2-a which would enable LSE. Note: Amazon RDS for PostgreSQL isn't impacted by this.

In November 2021 PostgreSQL started to distribute Ubuntu 20.04 packages optimized with -moutline-atomics. For Ubuntu 20.04, we recommend using the PostgreSQL PPA instead of the packages distributed by Ubuntu Focal. Please follow the instructions to set up the PostgreSQL PPA.

Python installation on some Linux distros

The default installation of pip on some Linux distributions is old (<19.3) to install binary wheel packages released for Graviton. To work around this, it is recommended to upgrade your pip installation using:

sudo python3 -m pip install --upgrade pip

Bazel on Linux

The Bazel build tool now releases a pre-built binary for arm64. As of October 2020, this is not available in their custom Debian repo, and Bazel does not officially provide an RPM. Instead, we recommend using the Bazelisk installer, which will replace your bazel command and keep bazel up to date.

Below is an example using the latest Arm binary release of Bazelisk as of October 2020:

wget https://github.com/bazelbuild/bazelisk/releases/download/v1.7.1/bazelisk-linux-arm64
chmod +x bazelisk-linux-arm64
sudo mv bazelisk-linux-arm64 /usr/local/bin/bazel
bazel

Bazelisk itself should not require further updates, as its only purpose is to keep Bazel updated.

zlib on Linux

Linux distributions, in general, use the original zlib without any optimizations. zlib-cloudflare has been updated to provide better and faster compression on Arm and x86. To use zlib-cloudflare:

git clone https://github.com/cloudflare/zlib.git
cd zlib
./configure --prefix=$HOME
make
make install

Make sure to have the full path to your lib at $HOME/lib in /etc/ld.so.conf and run ldconfig.

For users of OpenJDK, which is dynamically linked to the system zlib, you can set LD_LIBRARY_PATH to point to the directory where your newly built version of zlib-cloudflare is located or load that library with LD_PRELOAD.

You can check the libz that JDK is dynamically linked against with:

$ ldd /Java/jdk-11.0.8/lib/libzip.so | grep libz
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007ffff7783000)

Currently, users of Amazon Corretto cannot link against zlib-cloudflare.

Blog Posts

HPC

Machine Learning

Other

Optimized Video Encoding with FFmpeg on AWS Graviton Processors

Case Studies

HPC

Other

Lower Latency and Costs Using AWS Graviton2–Based Instances with Sprinklr

Additional resources

Feedback? [email protected]

aws-graviton-getting-started's People

Contributors

Stargazers

Watchers

Forkers

sebpop prekageo agsaidi geoffreyblake xujunbj boooowy jasonswindle awsjswinney adamtylerlynch rolandpj1968 odidev nmeyerhans davdunc zlim chefren tsahee takuya-miyazaki jnichols3 bellkev panyapoc jeffunderhill terrych0u jalawala vijoychoyi rshpount btc-free-mining xwngz dennisotugo cperciva krsnucc21 lhmt santiagocardenas tetsuzawa massimosporchia eznj amiracle ceefour bateleurx cloudreen matlaver liambirt qpc-database oby-gitdeveloper arthurpetitpierre janaknat sridhar-bharadwaj srini-ram henamar nluckenbill alvawymer spantheslayer dhendo suelsp scottefein snadampal jaboutboul developerworld-me simonis bmaguiraz quiver k-haley1 klattery cfanbo manuelbcd 0snug0 guikcd avram weltam lkhn srobin9 soujanyakonka dazevedonkey denisazevedo dougbarrett georgeerickson julianwood tozhu yogevel-lb skondla welly87 pavantatikonda otterley uttam92477 cristim barathk143 tbbharaj sheelng krishnasubbu79 twu-aws syl-taylor-aws igorgorovoy aibolstertechn test-mass-forker-org-1 jasonrandrews melodylail zhen-jia sandlerwang osoguer codekevinjung pathomkorn

aws-graviton-getting-started's Issues

Chrome missing for arm64

Nafea Bshara said to post here if we had any issues. Looks like Google only releases Chrome for Linux on amd architecture, no arm64. Only Chromium is available on arm64. We will see if we can use chromium as a part of our application, but thought to highlight this gap. Others may have the same issue.

sysstat version installed by perfrunbook for AL2 is too old

AL2 uses sysstat v10.1.5, but the perfrunbook uses features from later versions of sysstat. Perfrunbook needs to automate installing the proper version of sysstat on AL2. Trying to use --stat all-irqs fails on AL2 for ./measure_and_plot_basic_sysstat_stats.py.

rocksdb performance benchmark on gravition3, -march is better than -mcpu

I took a look at some document and slides from aws and arm.
for example
https://d1.awsstatic.com/events/Summits/reinvent2022/CMP325_Coding-for-multiple-CPU-architectures-using-lessons-learned-in-HPC.pdf

So looks like if I know the cpu of my system, the performance is better if I use -mcpu instead of -march.

But I am testing rocksdb, and its default build includes -march=armv8-a+crc+crypto
and I changed it with -mcpu=neoverse-512tvb+crc+crypto which gives me worse number

Results

-march=armv8-a+crc+crypto

3896769 Op/s

-mcpu=neoverse-512tvb+crc+crypto

OS distribution and System Info

Ubuntu 20.04.6 LTS

exact commands I used for testing rocksdb

install phoronix-test-suite

git clone https://github.com/phoronix-test-suite/phoronix-test-suite.git
sudo ./install-sh
might need install other dependency like

sudo apt install php7.4-cli
sudo apt-get install php-xml

execute benchmark

phoronix-test-suite benchmark rocksdb

I am using 5, Read While Writing

Then just wait for the results

gcc with -m64 flag -- unknow error

I am trying to build forked leveldb which has gcc flag -m64. It could not compile due to flag unknown. But the same is compiling on M1 Macbook Air. Will it be included in the next build? Otherwise it would have been major change in multiple repo.

graviton2 instance family should be disabled when the Centos 7 AMI is selected.

As describe in the doc, Centos 7 is not supported by Graviton2.

But when launching instance, once the Centos 7 AMI is chosen, the graviton2 instance types can be selected. This is quite confused which may cause customer to run Centos 7 on graviton 2 instances...

GCCv13.1 not taking the recommended flag

Hello,
I'm trying to optimize for Gv3E, but GCC (v13.1) is not taking the recommended flag

f951: Error: unknown value ‘neoverse-512tvb’ for ‘-march’
f951: note: valid arguments are: armv8-a armv8.1-a armv8.2-a armv8.3-a armv8.4-a armv8.5-a armv8.6-a armv8.7-a armv8.8-a armv8-r armv9-a armv9.1-a armv9.2-a armv9.3-a native
f951: note: did you mean ‘-mcpu=neoverse-512tvb’?
make: *** [Makefile:274: m3utilio.o] Error 1

Not sure if simply using -march=armv8.4-a will optimize for SVE. Is there any other recommendation?
Thanks.

Amazon Linux 2 gcc doesn't support -march flags

Using the current Amazon Linux 2 AMI and pulling gcc (i.e. sudo yum groupinstall -y "Development tools"), that version of gcc doesn't support -march=armv8.2-a+fp16+rcpc+dotprod+crypto (as suggested from https://github.com/aws/aws-graviton-getting-started/blob/master/c-c%2B%2B.md). it will accept -march=armv8.2-a+fp16+crypto.

PHP80 not included in amzn2-extras.repo

PHP80 not included in amzn2-extras.repo which make it hard for users to install php8.0 on Amazon Linux2 to enjoy the performance improvement brings by php8.0

OpenMPI is built against SGE

In HPC/scripts-setup/2a-install-openmpi-with-acfl.sh and HPC/README.md we are building OpenMPI with the SGE plugin. However we are using AWS ParallelCluster that has Slurm installed.

Does Graviton2 support RNG instructions?

This guide says that both Graviton2 and 3 support RNG instructions, but it seems that only Graviton3 supports RNG.
https://github.com/aws/aws-graviton-getting-started/blob/main/README.md?plain=1#L46

|RNG Instructions	|Yes	|Yes	|

Executing lscpu command on Graviton2 instance, rng feature isn't flagged. And rng-tools doesn't use RNG instruction.

Gravtiton2(c6g.medium) results

$ sudo rngd -f
Initializing available sources
[hwrng ]: Initialization Failed
[rndr  ]: No HW SUPPORT
[rndr  ]: Initialization Failed
[jitter]: Initializing AES buffer
[jitter]: Enabling JITTER rng support
[jitter]: Initialized
[pkcs11]: PKCS11 Engine /usr/lib64/opensc-pkcs11.so Error: No such file or directory
[pkcs11]: Initialization Failed
[jitter]: Shutting down

$ lscpu
Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 1
  On-line CPU(s) list:  0
Vendor ID:              ARM
  Model name:           Neoverse-N1
    Model:              1
    Thread(s) per core: 1
    Core(s) per socket: 1
    Socket(s):          1
    Stepping:           r3p1
    BogoMIPS:           243.75
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
Caches (sum of all):
  L1d:                  64 KiB (1 instance)
  L1i:                  64 KiB (1 instance)
  L2:                   1 MiB (1 instance)
  L3:                   32 MiB (1 instance)
NUMA:
  NUMA node(s):         1
  NUMA node0 CPU(s):    0
Vulnerabilities:
  Itlb multihit:        Not affected
  L1tf:                 Not affected
  Mds:                  Not affected
  Meltdown:             Not affected
  Mmio stale data:      Not affected
  Retbleed:             Not affected
  Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:           Mitigation; __user pointer sanitization
  Spectre v2:           Mitigation; CSV2, BHB
  Srbds:                Not affected
  Tsx async abort:      Not affected

Graviton3(c7g.medium) results

$ sudo rngd -f
Initializing available sources
[hwrng ]: Initialization Failed
[rndr  ]: Enabling aarch64 RNDR rng support
[rndr  ]: Initialized
[jitter]: Initializing AES buffer
[jitter]: Enabling JITTER rng support
[jitter]: Initialized
[pkcs11]: PKCS11 Engine /usr/lib64/opensc-pkcs11.so Error: No such file or directory
[pkcs11]: Initialization Failed
[rndr  ]: Shutting down
[jitter]: Shutting down

$ lscpu
Architecture:          aarch64
  CPU op-mode(s):      32-bit, 64-bit
  Byte Order:          Little Endian
CPU(s):                1
  On-line CPU(s) list: 0
Vendor ID:             ARM
  Model:               1
  Thread(s) per core:  1
  Core(s) per socket:  1
  Socket(s):           1
  Stepping:            r1p1
  BogoMIPS:            2100.00
  Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc                        flagm ssbs paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng
Caches (sum of all):
  L1d:                 64 KiB (1 instance)
  L1i:                 64 KiB (1 instance)
  L2:                  1 MiB (1 instance)
  L3:                  32 MiB (1 instance)
NUMA:
  NUMA node(s):        1
  NUMA node0 CPU(s):   0
Vulnerabilities:
  Itlb multihit:       Not affected
  L1tf:                Not affected
  Mds:                 Not affected
  Meltdown:            Not affected
  Mmio stale data:     Not affected
  Retbleed:            Not affected
  Spec store bypass:   Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:          Mitigation; __user pointer sanitization
  Spectre v2:          Mitigation; CSV2, BHB
  Srbds:               Not affected
  Tsx async abort:     Not affected

Which version of FFMPEG have the optimization

graviton gcc flags break pip install

I'm running into trouble with C-dependent python packages, specifically psutil on Amazon Linux 2 x2gd.large

It seems like it was meaning to run gcc -march=armv8.2-a+fp16+rcpc+dotprod+crypto _some_target_, but instead an extra space sneaked in and the whole .2-a+fp16+rcpc etc. part fell off. Anyone knows how to get around this?

  Command ['/home/ec2-user/.cache/pypoetry/virtualenvs/env-FCAoHQDC-py3.9/bin/pip', 'install', '--no-deps', 'file:///home/ec2-user/.cache/pypoetry/artifacts/b1/56/ea/8f76ac8e3267dcd0b251a107803fa457585315936bb70ef39c1775c7d1/psutil-5.9.0.tar.gz'] errored with the following return code 1, and output:
  Processing /home/ec2-user/.cache/pypoetry/artifacts/b1/56/ea/8f76ac8e3267dcd0b251a107803fa457585315936bb70ef39c1775c7d1/psutil-5.9.0.tar.gz
    Preparing metadata (setup.py): started
    Preparing metadata (setup.py): finished with status 'done'
  Building wheels for collected packages: psutil
    Building wheel for psutil (setup.py): started
    Building wheel for psutil (setup.py): finished with status 'error'
    error: subprocess-exited-with-error

    × python setup.py bdist_wheel did not run successfully.
    │ exit code: 1
    ╰─> [10 lines of output]
        running bdist_wheel
        running build
        running build_py
        running build_ext
        building 'psutil._psutil_linux' extension
        gcc: error: .2-a+fp16+rcpc+dotprod+crypto: No such file or directory
        gcc: error: .2-a+fp16+rcpc+dotprod+crypto: No such file or directory
        gcc: error: unrecognized command line option ‘-n1’; did you mean ‘-n’?
        gcc: error: unrecognized command line option ‘-n1’; did you mean ‘-n’?
        error: command '/usr/bin/gcc' failed with exit code 1
        [end of output]

[Python] Numbering of Python guide is off

The numbering of titles under Python on Graviton guide is off.

Installing Python packages
Installing Python packages
Machine Learning Python packages
Other Python packages

Since the guides for other languages don't use numbering, is it better to just remove the numbering for this guide as well to make it more consistent with the rest?

There is one minor point that I notice:

Should provide the reference to 'Python 2.7 is EOL since January the 1st 2020': https://peps.python.org/pep-0373/#maintenance-releases

I am open to creating a PR to update the doc if that's possible.

Thanks,

Neoverse N1 Software Optimization Guide link is broken

c6g specifics within machinelearning/pytorch.md

While machinelearning/pytorch.md has specifics for graviton3 based instances types, having details specific to graviton2 would help customers using the c6g instance types available on the 1u form factor of Outposts Servers. graviton3 is not available for Outposts Servers, thus customers with ml inference use cases that need to deploy to "The Edge" and choose Outposts servers, need optimization details for graviton2.

I did a little research which found the following might be applicable, but hoping experts can weigh in.
export DNNL_DEFAULT_FPMATH_MODE=ANY

Any news regarding OpsWorks Stacks support for Graviton2 instances?

Our servers run on OpsWorks Stacks, but t4g instance types do not seem to be offered.

Add More Detail to ElastiCache

In the managed_services.md document, in the "Amazon ElastiCache" section, add an additional link to https://aws.amazon.com/about-aws/whats-new/2021/11/amazon-elasticache-supports-t4g-graviton2-based-instances/ to cover support for more instance types.

I had originally thought that only certain instance types (cache.m6g and cache.r6g) were supported. The additional blog post covers additional cache.t4g types.

Request: Publish ARM64 Istio Proxy image

I'm interested in using graviton, but our infrastructure relies on istio proxy. The current workaround of building the container isn't viable at the moment - would it be feasible for a prebuilt image to be published? It would reduce adoption friction significantly!

pycld2 Python package not supported with Graviton2

Seems that https://pypi.org/project/pycld2/ is not supported on Graviton2.

ECS Exit code: 132 on Graviton3

I running my app written in Rust on ECS backed by Graviton3 EC2. It runs as expected.

After my attempt to optimize compilation using processor specific things RUSTFLAGS="-C target-cpu=neoverse-v1 -C target-feature=+sve2,+lse" the app starts even prints some Wellcome output and stops with Exit code: 132.

I started separate EC2 with Graviton3 processor. Installed Docker there and same app with same dokerfile works there like a charm.

Meanwhile, if I switch neoverse-v1 (which supposed to be supported by G3) to neoverse-n1 it works on ECS.

Is there some additional tuning required to use +sve2,+lse on ECS?

Tensorflow Java

(Filing this feature request according to https://github.com/aws/aws-graviton-getting-started/blob/main/java.md#looking-for-x86-shared-objects-in-jars)

https://github.com/tensorflow/java claims to support ARM but there's no officially distributed artifacts (JAR) on Maven.

perf rpm missing libperf-jvmti.so

I'm trying to analyze the performance of an application written in scala running on Graviton2 using the Profiling Java applications.

However, there is no libperf-jvmti.so in perf rpm.
Is there any way to install libperf-jvmti.so?

AMI: amzn2-ami-hvm-2.0.20210525.0-arm64-gp2
Instance Type: t4g.large
perf package: perf-4.14.232-176.381.amzn2.aarch64

Thank you.

Latency Improvement numbers are not matching for common models like BERT when benchmarking graviton vs intel

We have started benchmarking latency for inference on common huggingface models like BERT, DistilBERT. We are running two pods, one on graviton node (c7g.4xlarge) and another on intel node (c6i.8xlarge). Both have 1 CPU as request and limit, 6GB of memory as request and limit. For same RPS, we are getting around 20-30% less latency on graviton pod compared to Intel. Standard benchmarking practices are followed to keep both setups identical in all other aspects.

Some details how we created graviton pod.

We used the base image "armswdev/pytorch-arm-neoverse:r22.10-torch-1.12.0-onednn-acl" as mentioned in https://github.com/aws/aws-graviton-getting-started/blob/main/machinelearning/pytorch.md. We install all the needed dependencies which are compatible with aarch64 architecture and are recommended version here https://github.com/aws/aws-graviton-getting-started/. More specifically the versions of common dependencies are:

a) numpy==1.21.5
b) pandas==1.1.3
c) transformers==4.23.1
d) scikit-learn==1.1.3
e) Keras==2.6.0
We followed the steps in this guide https://github.com/aws-samples/aws-graviton-ml-inference-armnn-example/tree/main/src/Dockerfile to install "pyarmnn" version 30.0.0
We enabled the flag export DNNL_DEFAULT_FPMATH_MODE=BF16. Huge pages and OMP_NUM_THREADS are not updated since these are not present in intel setup too for basic numbers comparison (Since the first step is to see the performance gain which we get using bf16 and other hardware improvements introduced in graviton).

As suggested in https://github.com/aws/aws-graviton-getting-started/blob/main/machinelearning/pytorch.md, we turned on the debug mode logs using DNNL_VERBOSE and OMP_DISPLAY_ENV variables. Upon logs investigation, it looks ACL GEMM kernels are not being used instead CPP ref is being used.

Required libraries are installed in the system namely - "pyarmnn", "armnn-latest-cpu", "python3-pyarmnn".
These similar logs are printed when we run our in-house model pipeline and when we run BERT Benchmark using MLPerf as suggested in getting started guide.
Attached is the screenshot of logs in production system and notebook graviton_basic_exp.txt shows basic steps taken to run BERT Benchmark using MLPerf.

When trying to use pyarmnn library to load and do inference using resent model on top of above pytorch docker image, in order to see if that logs ACL GEMM Kernels being used, the notebook kernel exits abruptly, indicating runtime issue with the setup. Steps and code was taken from aws-samples repo here: https://github.com/aws-samples/aws-graviton-ml-inference-armnn-example/tree/main/src. Code for same is included in pyarmnn_exp.txt.

Please help us in understanding if some config/setup is missing. At the end of the day, we want the latency numbers of graviton after using bf16 acl gemm kernels to be better than intel as claimed in AWS reinvent.
Thanks.

MongoDB - building from source still required?

Is the note in https://github.com/aws/aws-graviton-getting-started/blob/main/README.md#recent-software-updates-relevant-to-graviton regarding mongodb and LSE locks still valid, given what looks like a backport in 4.9.0, 4.2.13, 4.4.5

https://jira.mongodb.org/browse/SERVER-51722

It looks like 4.4.7+ may have LSE enable due to the use of an upgraded compiler.
https://jira.mongodb.org/browse/SERVER-56237

The 5.x builds look to target armv8.2-a by default anyway:
https://jira.mongodb.org/browse/SERVER-56347

Tensorflow install error

cd $HOME/tensorflow/tensorflow/examples/label_image/data
wget https://storage.googleapis.com/download.tensorflow.org/models/inception_v3_2016_08_28_frozen.pb.tar.gz
tar xf inception_v3_2016_08_28_frozen.pb.tar.gz
cd $HOME/tensorflow
bazel build --config=opt --copt=-O3 --copt=-march=armv8.2-a+fp16+rcpc+dotprod+crypto --copt=-flax-vector-conversions tensorflow/examples/label_image/...
bazel-bin/tensorflow/examples/label_image/label_image

Lambda Graviton is slower than X86 for FFT

Setting

numpy==1.22.1
function numpy.fft.fft()
handler test running time of this
lambda 10240MB RAM, 90 seconds timeout, python 3.8, same ecr based image
"""
sig = [np.random.randint(0, 1000, (4098, 600)) for k in range(4)]
for x in sig:
np.fft.fft(x, axis=0)
"""
result

on average Lambda X86 takes 700ms, and Lambda Graviton takes 832ms
when using multi-thread, Lambda X86 takes 176ms, Lambda Graviton takes 210 ms

"""
import json
import numpy as np
from concurrent.futures import ThreadPoolExecutor
from datetime import datetime

def single_thread_fft(sig):
"""
normal fft
"""
start_time = datetime.now()
for x in sig:
np.fft.fft(x, axis=0)
end_time = datetime.now()
delta_time = end_time.timestamp() - start_time.timestamp()
print("single thread running time {0} ms".format(delta_time * 1000))
return delta_time

def multi_thread_fft(sig):
"""
thread fft
"""
start_time = datetime.now()
with ThreadPoolExecutor(max_workers=4) as executor:
for x in sig:
executor.submit(np.fft.fft, x, axis=0)
end_time = datetime.now()
delta_time = end_time.timestamp() - start_time.timestamp()
print("multi thread running time {0} ms".format(delta_time * 1000))
return delta_time

def lambda_handler(event, context):
"""
Lambda handler
"""
# signal for one channel
sig = [np.random.randint(0, 1000, (4098, 600)) for k in range(4)]
# single thread
single_thread_time = single_thread_fft(sig)
# multi thread
multi_thread_time = multi_thread_fft(sig)
# response
return {
'statusCode': 200,
'headers': {
"Access-Control-Allow-Origin": "",
"Access-Control-Allow-Headers": "Content-Type",
"Access-Control-Allow-Methods": "OPTIONS,GET"
},
'body': json.dumps({'single thread': "{0}, multi thread: {1}".format(single_thread_time * 1000, multi_thread_time1000)},
indent=4,
sort_keys=True,
default=str)
}
"""

Pytorch benchmark for Restnet50 with batch size of 1 on Graviton3 worse than on Intel

I ran the Resnet50 benchmark using instructions provided in this repo here. I ran tests on M7g.large and M7i.large instances to compare performance between G3 and Intel. Results are available here. Screenshot of the spreadsheet here:

The results show G3 performing ~47% worse than intel on Resnet50 when the batch size is 1 (G3 does better than intel with other models and batch sizes).

Also worth noting, the instructions to run the test in this repo (linked above) didn't work for me. I had to change:
python3 run.py Resnet50 -d cpu -m jit -t eval --use_cosine_similarity --bs 32
to
python3 run.py Resnet50 -d cpu -t eval --use_cosine_similarity --bs 32

(I had to remove the -m jit flag because it was unrecognized)

Typo in Graviton3(E) performance vs. balanced `-mcpu` flag?

I would expect performance to be -mcpu neoverse-512tvb and balanced to be -mcpu neoverse-n1 (previous generation), but this table seems to be flipped. At least that seems like what is suggested by the table in the main README here.

Palo Alto Networks/Prisma Cloud support

Prisma Cloud cannot scan arm64 images or run as a Defender sidecar for same. While there are solutions for customers who are evaluating solutions in this space, it would be great to push Palo Alto to support arm64 for customers who are wedded to this vendor.

Cannot run measure_and_plot_basic_sysstat_stats.py

 python3 ./measure_and_plot_basic_sysstat_stats.py --stat cpu-iowait --time 60
Failed to measure statistics with sar.
Please check that sar is installed using install_perfrunbook_dependencies.sh and is in your PATH
Traceback (most recent call last):
  File "/home/ec2-user/aws-graviton-getting-started/perfrunbook/utilities/./measure_and_plot_basic_sysstat_stats.py", line 208, in <module>
    plot(text, stat)
  File "/home/ec2-user/aws-graviton-getting-started/perfrunbook/utilities/./measure_and_plot_basic_sysstat_stats.py", line 117, in plot_cpu
    df = parse_sar(ParseCpuTime, buf)
  File "/home/ec2-user/aws-graviton-getting-started/perfrunbook/utilities/./measure_and_plot_basic_sysstat_stats.py", line 94, in parse_sar
    line = buf.readline()
AttributeError: 'NoneType' object has no attribute 'readline'

I have installed the dependencies and i can execute sar.

I use AL2 (ami-084237e82d7842286 in us-east-1)

[golang] Graviton2 lambda has problem running CGO enabled code

Code as below:

package main

import (
	"context"
	"fmt"

	"github.com/aws/aws-lambda-go/lambda"
)

func main() {
	lambda.Start(LambdaHandler)
}

func LambdaHandler(ctx context.Context) error {
	fmt.Println("hello, world")
	return nil
}

Build with CGO_ENABLED=1 go build -race -o bootstrap . under a Graviton2 EC2 instance (you can cross-compile but this makes things less complicated), run bootstrap on the EC2 instance with help of RIE (https://docs.aws.amazon.com/lambda/latest/dg/images-test.html) without problem, however, if I run it as a lambda with provided.al2+arm64, it reports:

FATAL: ThreadSanitizer: unsupported VMA range
FATAL: Found 39 - Supported 48

I know this may be a go specific problem, but I need to prepare steps to reproduce this before I report this to go community and we need a way to reproduce this problem under EC2 where people can debug.

I've opened a support ticket to AWS (10781433961) asking what's the difference between an EC2 instance and a Lambda that both are Graviton2 based, while the ticket is being worked on, I think I'd better share the issue here as well to get more eyes for both potential solution and awareness.

Floating Point Performance

I have been looking at the floating point performance of Graviton3.
Assuming 2.6GHz clock and two 256 SVE pipelines per core, I'd expect 2.6*8*2*2=83.2 GFLOPS peak per core when running FP32 workloads.

The tested workload is just a loop over 30 SVE FMA instructions:

        .text
        .type peak_sve_fmla_sp, %function
        .global peak_sve_fmla_sp
        /*
         * Microbenchmark measuring achievable performance using fmla instructions.
         * Repeats 30 independent SP SVE FMAs.
         * @param x0 number of repetitions.
         * @return number of flops per iteration.
         */ 
peak_sve_fmla_sp:
        // set predicate register
        ptrue p0.b

        // PCS: save required data in SIMD registers to stack
        stp  d8,  d9, [sp, #-16]!
        stp d10, d11, [sp, #-16]!
        stp d12, d13, [sp, #-16]!
        stp d14, d15, [sp, #-16]!

        // set SIMD registers to zero
        eor z0.d, z0.d, z0.d
        eor z1.d, z1.d, z1.d
        eor z2.d, z2.d, z2.d
        eor z3.d, z3.d, z3.d

        eor z4.d, z4.d, z4.d
        eor z5.d, z5.d, z5.d
        eor z6.d, z6.d, z6.d
        eor z7.d, z7.d, z7.d

        eor z8.d, z8.d, z8.d
        eor z9.d, z9.d, z9.d
        eor z10.d, z10.d, z10.d
        eor z11.d, z11.d, z11.d

        eor z12.d, z12.d, z12.d
        eor z13.d, z13.d, z13.d
        eor z14.d, z14.d, z14.d
        eor z15.d, z15.d, z15.d

        eor z16.d, z16.d, z16.d
        eor z17.d, z17.d, z17.d
        eor z18.d, z18.d, z18.d
        eor z19.d, z19.d, z19.d

        eor z20.d, z20.d, z20.d
        eor z21.d, z21.d, z21.d
        eor z22.d, z22.d, z22.d
        eor z23.d, z23.d, z23.d

        eor z24.d, z24.d, z24.d
        eor z25.d, z25.d, z25.d
        eor z26.d, z26.d, z26.d
        eor z27.d, z27.d, z27.d

        eor z28.d, z28.d, z28.d
        eor z29.d, z29.d, z29.d
        eor z30.d, z30.d, z30.d
        eor z31.d, z31.d, z31.d

        // perform the operations
loop_repeat:
        sub x0, x0, #1
        fmla z0.s, p0/m, z30.s, z31.s
        fmla z1.s, p0/m, z30.s, z31.s
        fmla z2.s, p0/m, z30.s, z31.s
        fmla z3.s, p0/m, z30.s, z31.s

        fmla z4.s, p0/m, z30.s, z31.s
        fmla z5.s, p0/m, z30.s, z31.s
        fmla z6.s, p0/m, z30.s, z31.s
        fmla z7.s, p0/m, z30.s, z31.s

        fmla z8.s, p0/m, z30.s, z31.s
        fmla z9.s, p0/m, z30.s, z31.s
        fmla z10.s, p0/m, z30.s, z31.s
        fmla z11.s, p0/m, z30.s, z31.s

        fmla z12.s, p0/m, z30.s, z31.s
        fmla z13.s, p0/m, z30.s, z31.s
        fmla z14.s, p0/m, z30.s, z31.s
        fmla z15.s, p0/m, z30.s, z31.s

        fmla z16.s, p0/m, z30.s, z31.s
        fmla z17.s, p0/m, z30.s, z31.s
        fmla z18.s, p0/m, z30.s, z31.s
        fmla z19.s, p0/m, z30.s, z31.s

        fmla z20.s, p0/m, z30.s, z31.s
        fmla z21.s, p0/m, z30.s, z31.s
        fmla z22.s, p0/m, z30.s, z31.s
        fmla z23.s, p0/m, z30.s, z31.s
        
        fmla z24.s, p0/m, z30.s, z31.s
        fmla z25.s, p0/m, z30.s, z31.s
        fmla z26.s, p0/m, z30.s, z31.s
        fmla z27.s, p0/m, z30.s, z31.s

        fmla z28.s, p0/m, z30.s, z31.s
        fmla z29.s, p0/m, z30.s, z31.s
        cbnz x0, loop_repeat

        // PCS: restore SIMD registers
        ldp d14, d15, [sp], #16
        ldp d12, d13, [sp], #16
        ldp d10, d11, [sp], #16
        ldp  d8,  d9, [sp], #16


        // write number of flops to return register
        mov x0, 30*16

        ret
        .size peak_sve_fmla_sp, (. - peak_sve_fmla_sp)

Now, I am able to sustain the assumed theoretical peak almost when using a single core of a c7g.16xlarge instance:

[fedora@ip-172-31-24-119 aarch64_micro]$ OMP_PLACES={0}:64:1 OMP_NUM_THREADS=1 perf stat build/micro_sve
running SVE microbenchmarks
peak_sve_fmla_sp
  repetitions: 1730715545
  duration: 10.0032 seconds
  GFLOPS: 83.0478
peak_sve_fmla_dp
  repetitions: 1731000752
  duration: 10.0054 seconds
  GFLOPS: 41.5216
finished SVE microbenchmarks

 Performance counter stats for 'build/micro_sve':

         20,018.84 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
               120      page-faults:u             #    5.994 /sec                   
    51,961,113,748      cycles:u                  #    2.596 GHz                    
   110,841,516,422      instructions:u            #    2.13  insn per cycle         
   <not supported>      branches:u                                                  
            20,884      branch-misses:u                                             

      20.022476242 seconds time elapsed

      20.015139000 seconds user
       0.000000000 seconds sys

The performance of a single c7g.4xlarge core is close at least:

[fedora@ip-172-31-28-219 aarch64_micro]$ OMP_PLACES={0}:16:1 OMP_NUM_THREADS=1 perf stat build/micro_sve
running SVE microbenchmarks
peak_sve_fmla_sp
  repetitions: 1692693235
  duration: 10.0608 seconds
  GFLOPS: 80.7581
peak_sve_fmla_dp
  repetitions: 1729887420
  duration: 10.0709 seconds
  GFLOPS: 41.225
finished SVE microbenchmarks

 Performance counter stats for 'build/micro_sve':

         20,142.81 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
               120      page-faults:u             #    5.957 /sec                   
    52,312,307,459      cycles:u                  #    2.597 GHz                    
   109,589,127,048      instructions:u            #    2.09  insn per cycle         
   <not supported>      branches:u                                                  
            20,972      branch-misses:u                                             

      20.146160162 seconds time elapsed

      20.139928000 seconds user
       0.000000000 seconds sys

When using a single core of a c7g.2xlarge, there's a large drop to 76 GFLOPS:

[ec2-user@ip-172-31-19-148 ~]$ OMP_NUM_THREADS=1 ./build/micro_sve 
running SVE microbenchmarks
peak_sve_fmla_sp
  repetitions: 1598328787
  duration: 10.0323 seconds
  GFLOPS: 76.4727
peak_sve_fmla_dp
  repetitions: 1600978005
  duration: 10.0488 seconds
  GFLOPS: 38.2369
finished SVE microbenchmarks

Strangely, I am observing a similar behavior on the larger instances when trying to use more than one core.
Here are measurements 2, 4, 16, 32 and 64 cores of c7g.16xlarge:

[fedora@ip-172-31-24-119 aarch64_micro]$ OMP_PLACES={0}:64:1 OMP_NUM_THREADS=2 perf stat build/micro_sve
running SVE microbenchmarks
peak_sve_fmla_sp
  repetitions: 1725007275
  duration: 9.97009 seconds
  GFLOPS: 166.098
peak_sve_fmla_dp
  repetitions: 1728473120
  duration: 9.98952 seconds
  GFLOPS: 83.0537
finished SVE microbenchmarks

 Performance counter stats for 'build/micro_sve':

         39,936.28 msec task-clock:u              #    1.999 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
               121      page-faults:u             #    3.030 /sec                   
   103,672,983,374      cycles:u                  #    2.596 GHz                    
   221,162,325,597      instructions:u            #    2.13  insn per cycle         
   <not supported>      branches:u                                                  
            21,613      branch-misses:u                                             

      19.973798458 seconds time elapsed

      39.927679000 seconds user
       0.000000000 seconds sys

[fedora@ip-172-31-24-119 aarch64_micro]$ OMP_PLACES={0}:64:1 OMP_NUM_THREADS=4 perf stat build/micro_sve
running SVE microbenchmarks
peak_sve_fmla_sp
  repetitions: 1725939454
  duration: 9.97473 seconds
  GFLOPS: 332.22
peak_sve_fmla_dp
  repetitions: 1726312489
  duration: 9.97691 seconds
  GFLOPS: 166.109
finished SVE microbenchmarks

 Performance counter stats for 'build/micro_sve':

         79,836.42 msec task-clock:u              #    3.999 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
               125      page-faults:u             #    1.566 /sec                   
   207,269,007,131      cycles:u                  #    2.596 GHz                    
   442,168,462,669      instructions:u            #    2.13  insn per cycle         
   <not supported>      branches:u                                                  
            20,704      branch-misses:u                                             

      19.965857449 seconds time elapsed

      79.819401000 seconds user
       0.000000000 seconds sys


[fedora@ip-172-31-24-119 aarch64_micro]$ OMP_PLACES={0}:64:1 OMP_NUM_THREADS=8 perf stat build/micro_sve
running SVE microbenchmarks
peak_sve_fmla_sp
  repetitions: 1595665661
  duration: 10.3982 seconds
  GFLOPS: 589.27
peak_sve_fmla_dp
  repetitions: 1498586458
  duration: 9.77112 seconds
  GFLOPS: 294.468
finished SVE microbenchmarks

 Performance counter stats for 'build/micro_sve':

        154,807.52 msec task-clock:u              #    7.669 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
               133      page-faults:u             #    0.859 /sec                   
   401,908,286,632      cycles:u                  #    2.596 GHz                    
   792,719,559,909      instructions:u            #    1.97  insn per cycle         
   <not supported>      branches:u                                                  
            22,628      branch-misses:u                                             

      20.185053347 seconds time elapsed

     154.774853000 seconds user
       0.000000000 seconds sys


[fedora@ip-172-31-24-119 aarch64_micro]$ OMP_PLACES={0}:64:1 OMP_NUM_THREADS=16 perf stat build/micro_sve
running SVE microbenchmarks
peak_sve_fmla_sp
  repetitions: 1307910346
  duration: 9.76124 seconds
  GFLOPS: 1029.04
peak_sve_fmla_dp
  repetitions: 1334252900
  duration: 9.95711 seconds
  GFLOPS: 514.56
finished SVE microbenchmarks

 Performance counter stats for 'build/micro_sve':

        315,248.61 msec task-clock:u              #   15.973 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
               149      page-faults:u             #    0.473 /sec                   
   818,493,379,816      cycles:u                  #    2.596 GHz                    
 1,353,991,835,416      instructions:u            #    1.65  insn per cycle         
   <not supported>      branches:u                                                  
            24,883      branch-misses:u                                             

      19.736475374 seconds time elapsed

     315.181031000 seconds user
       0.000000000 seconds sys


[fedora@ip-172-31-24-119 aarch64_micro]$ OMP_PLACES={0}:64:1 OMP_NUM_THREADS=32 perf stat build/micro_sve
running SVE microbenchmarks
peak_sve_fmla_sp
  repetitions: 1163100406
  duration: 9.71002 seconds
  GFLOPS: 1839.87
peak_sve_fmla_dp
  repetitions: 1196469076
  duration: 9.98596 seconds
  GFLOPS: 920.18
finished SVE microbenchmarks

 Performance counter stats for 'build/micro_sve':

        626,051.22 msec task-clock:u              #   31.752 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
               185      page-faults:u             #    0.296 /sec                   
 1,625,440,957,093      cycles:u                  #    2.596 GHz                    
 2,418,752,455,251      instructions:u            #    1.49  insn per cycle         
   <not supported>      branches:u                                                  
            29,207      branch-misses:u                                             

      19.716774631 seconds time elapsed

     625.582341000 seconds user
       0.319759000 seconds sys


[fedora@ip-172-31-24-119 aarch64_micro]$ OMP_PLACES={0}:64:1 OMP_NUM_THREADS=64 perf stat build/micro_sve
running SVE microbenchmarks
peak_sve_fmla_sp
  repetitions: 1062696093
  duration: 9.54021 seconds
  GFLOPS: 3421.94
peak_sve_fmla_dp
  repetitions: 1114298003
  duration: 10.0057 seconds
  GFLOPS: 1710.58
finished SVE microbenchmarks

 Performance counter stats for 'build/micro_sve':

      1,245,583.31 msec task-clock:u              #   63.642 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
               255      page-faults:u             #    0.205 /sec                   
 3,233,529,000,308      cycles:u                  #    2.596 GHz                    
 4,463,949,961,374      instructions:u            #    1.38  insn per cycle         
   <not supported>      branches:u                                                  
            36,783      branch-misses:u                                             

      19.571567046 seconds time elapsed

    1244.648765000 seconds user
       0.599638000 seconds sys

Is this behavior known or expected?
Is there any information available on this?

Compilation of Pytorch fails

I'm following the instruction in this Python doc, but fail with the following error:

/home/ubuntu/pytorch/aten/src/ATen/cpu/vec256/missing_vst1_neon.h:5:1: error: redefinition of ‘void vst1q_f32_x2(float32_t*, float32x4x2_t)’
    5 | vst1q_f32_x2 (float32_t * __a, float32x4x2_t val)
      | ^~~~~~~~~~~~
In file included from /home/ubuntu/pytorch/aten/src/ATen/cpu/vec256/intrinsics.h:22,
                 from /home/ubuntu/pytorch/aten/src/ATen/cpu/FlushDenormal.cpp:3:
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h:28197:1: note: ‘void vst1q_f32_x2(float32_t*, float32x4x2_t)’ previously defined here
28197 | vst1q_f32_x2 (float32_t * __a, float32x4x2_t val)
      | ^~~~~~~~~~~~
make[2]: *** [caffe2/CMakeFiles/torch_cpu.dir/build.make:636: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/FlushDenormal.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:8902: caffe2/CMakeFiles/torch_cpu.dir/all] Error 2
make: *** [Makefile:141: all] Error 2

I'm compiling Pytorch on ubuntu 20.04 and the compiler is gcc 9.3.0.

It seems Pytorch changed recently. aten/src/ATen/cpu/vec256/missing_vst1_neon.h is added to Pytorch very recently. Could you provide clear instructions on compiling Pytorch?

gcc -march=armv8-a generates LSE instructions but not -march=armv8.2-a or -mcpu=neoverse-n1

Hi,

I compiled PostgreSQL without specific flags (on m6gd in this blog post https://dev.to/aws-heroes/aws-postgresql-on-graviton2-with-newer-gcc-3aha) and the ./configure compiles with -march=armv8-a+crc
I tried -march=armv8.2-a and -mcpu=neoverse-n1 and -mtune=neoverse-n1 as it is what Graviton2 is supposed to be but didn't get LSE optimisations.

I've tested different combinations.
I have a c6gn instance:

[ec2-user@ip-172-31-4-46 ~]$ echo $(curl -s http://169.254.169.254/latest/meta-data/instance-type) $(uname -m)
c6gn.xlarge aarch64

This is Graviton 2:

[ec2-user@ip-172-31-4-46 ~]$ head /proc/cpuinfo
processor       : 0
BogoMIPS        : 243.75
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x3
CPU part        : 0xd0c
CPU revision    : 1

This is ARM (0x41) Neoverse N1 (0xD0CU)

I have the latest PostgreSQL devel snapshot (https://ftp.postgresql.org/pub/snapshot/dev/postgresql-snapshot.tar.gz) and GCC 11:

PostgreSQL 14devel on aarch64-unknown-linux-gnu, compiled by gcc (GCC) 11.1.1 20210515, 64-bit

Without setting CFLAGS the ./configure runs with -march=armv8-a+crc and I have the following ARM optimisations:

[ec2-user@ip-172-31-4-46 postgresql-14devel]$ objdump -d /usr/local/pgsql/bin/postgres | awk '/(ld|st)a?xr/{print $3}/__aarch64_/{sub(/[>+].*/,">",$0);print $NF}' | sort | uniq -c
    112 <__aarch64_cas4_acq_rel>
     13 <__aarch64_cas8_acq_rel>
     13 <__aarch64_ldadd4_acq_rel>
      5 <__aarch64_ldadd8_acq_rel>
      4 <__aarch64_ldclr4_acq_rel>
      6 <__aarch64_ldset4_acq_rel>
     34 <__aarch64_swp4_acq>
      7 ldaxr
      1 stxr

However, if I add the mcpu for Neoverse N1, with CFLAGS="-march=armv8-a -mcpu=neoverse-n1" I see less cas4 instructions

[ec2-user@ip-172-31-4-46 postgresql-14devel]$ objdump -d /usr/local/pgsql/bin/postgres | awk '/(ld|st)a?xr/{print $3}/__aarch64_/{sub(/[>+].*/,">",$0);print $NF}' | sort | uniq -c
     10 <__aarch64_cas4_acq_rel>
     13 <__aarch64_cas8_acq_rel>
     13 <__aarch64_ldadd4_acq_rel>
      5 <__aarch64_ldadd8_acq_rel>
      4 <__aarch64_ldclr4_acq_rel>
      6 <__aarch64_ldset4_acq_rel>
     34 <__aarch64_swp4_acq>
      7 ldaxr
      1 stxr

With CFLAGS="-march=armv8-a+crc -mtune=neoverse-n1" there are more cas8

[ec2-user@ip-172-31-4-46 postgresql-14devel]$ objdump -d /usr/local/pgsql/bin/postgres | awk '/(ld|st)a?xr/{print $3}/__aarch64_/{sub(/[>+].*/,">",$0);print $NF}' | sort | uniq -c
     10 <__aarch64_cas4_acq_rel>
    117 <__aarch64_cas8_acq_rel>
     13 <__aarch64_ldadd4_acq_rel>
      5 <__aarch64_ldadd8_acq_rel>
      4 <__aarch64_ldclr4_acq_rel>
      6 <__aarch64_ldset4_acq_rel>
     34 <__aarch64_swp4_acq>
      7 ldaxr
      1 stxr

And finally the recommendation from https://github.com/aws/aws-graviton-getting-started/blob/main/c-c%2B%2B.md#cc-on-graviton
flags "-march=armv8.2-a+fp16+rcpc+dotprod+crypto -mtune=neoverse-n1":

[ec2-user@ip-172-31-4-46 postgresql-14devel]$ objdump -d /usr/local/pgsql/bin/postgres | awk '/(ld|st)a?xr/{print $3}/__aarch64_/{sub(/[>+].*/,">",$0);print $NF}' | sort | uniq -c
[ec2-user@ip-172-31-4-46 postgresql-14devel]$ nm ./src/backend/postgres | grep -E "aarch64(_have_lse_atomics)?"

I see no LSE instructions.

Performance Benchmarking Result of Kafka with CLient Encryption in Graviton is Worse compared to Non-Graviton

Hi, we were trying to do a fixed load performance test Kafka on Graviton (r6g.large) vs non-Graviton (r5.large) and it seems that Kafka on Graviton is doing way worse than it's Non-Graviton counterpart (only around half the throughput). The setup:

Client encryption is on
Created stressor boxes (one producer and one consumer) in the same data centre as the Kafka cluster
Using librdkafka to stress test
The rate is 7m messages/second at 100bytes each
Using AWS Corretto
Kafka 2.6.1
Generated the symbols as per https://www.slideshare.net/brendangregg/java-performance-analysis-on-linux-with-flame-graphs

Here are the flame graph of the two runs, sampled at 1000hz for 1 minute during load in the zip file. flame-G is for Graviton node and flame-NG is for non-Graviton node.
FlameGraphs.zip

Seems to me that in Graviton we spend much more time in encoding the encrypted message. Is there a known issue and workaround for this? For additional information, I did the same benchmarking setup except I turned off the client encryption. The result is that Graviton performs better than Non-Graviton counterpart. Find attached the benchmarking result
BenchmarkingResult.tar.gz

Thanks for your help.

ffmpeg performance on graviton3 is not as good as x86

Hi, I used phoronix-test-suite to benchmark performance on x86 and graviton3 instance. This test suite uses vbench to benchmark performance of ffmpeg. I used c6a.8xlarge and c7g.8xlarge instances. However the result is not expected. Maybe I did something wrong?

Scenario

Encoder: libx264 - Scenario: Video on demand

c6a.8xlarge

FPS: 44.6
seconds: 170

c7g.8xlarge

FPS: 30.64
seconds: 247

From this blog, seems graviton has some optimization

https://aws.amazon.com/blogs/opensource/optimized-video-encoding-with-ffmpeg-on-aws-graviton-processors/

OpenSearch is slower on Graviton

Hi, I am benchmarking OpenSearch on Graviton VS x86 using same node configuration as https://aws.amazon.com/blogs/big-data/improved-performance-with-aws-graviton2-instances-on-amazon-opensearch-service/ lists.

And I am using esrally with pmc track and append-no-conflicts challenge. Here is my benchmarking result:

Mean Throughput

Operation	unit	r5.2xlarge	r6g.2xlarge
index-append	docs/s	1050.14	1615.12
default	ops/s	19.98	19.95
term	ops/s	19.99	19.98
phrase	ops/s	20	19.99
articles_monthly_agg_uncached	ops/s	19.97	19.93
articles_monthly_agg_cached	ops/s	20.01	19.99
scroll	pages/s	9.51	10.52

90th percentile latency

Operation	unit	r5.2xlarge	r6g.2xlarge
index-append	ms	6,957.41	4,764.48
default	ms	27.37	29.29
term	ms	27.36	31.99
phrase	ms	25.37	29.97
articles_monthly_agg_uncached	ms	26.08	29.36
articles_monthly_agg_cached	ms	16.69	22.00
scroll	ms	89,178.84	17,291.76

Except index-append and scroll, all others operations have higher latency on Graviton.
Did I miss something?
Thanks in advance.

Quant calculations is taking more time than expected

Hi, we were trying to do a benchmarking on Graviton (c6g.2xlarge) vs non-Graviton (c5.2xlarge) and it seems that calculations on Graviton is slower than it's non-Graviton counterpart. The setup:

Compiled C++ code from https://www.quantstart.com/articles/Asian-option-pricing-with-C-via-Monte-Carlo-Methods/
GCC 11.2.1
Amazon Linux 2022

Here are the flame graph of the two runs.

From the graphs it seems the function calc_path_spot_prices is taking more time in Graviton. So I had a look and realised the function is using exp in the calculations. Is the math library not optimized on ARM? How can we optimize the math routines?

Thanks for your help.

Missing GCC libquadmath

Missing libquadmath on Graviton2 cause following error while compiling the source code based on boost library .
quadmath.h: No such file or directory

It is because of that the float128.hpp from boost library has dependency on "quadmath.h" from GCC.

However, GCC won't compile libquadmath on aarch64 architecture since it already support long double data type. As suggestion in bug report, it is recommended to use tgmath.h instead of quadmath.h. However, it seems boost library doesn't support tgmath.h yet.

Based on the patch attached in bug reported mentioned above, it is possible compile libquadmath on Graviton2 to resolve the issue. This repo contains information to compile libquadmath on Graviton2 for GCC 9.4.

I would like to contribute the content to this repo. Before I am doing so, could you give me some opinions?

Thanks in advance.

[c7g.8xlarge] -mcpu=native vs -mcpu=neoverse-512tvb

Accroding to the guide
https://github.com/aws/aws-graviton-getting-started/blob/main/c-c%2B%2B.md

seems I should use -mcpu=neoverse-512tvb if I am using gravtion3 instance to get best performance.

But when I run echo | gcc -E - -mcpu=native -v 2>&1 | grep cc1, it shows the -mcpu=native is

cc1 -E -quiet -v -imultiarch aarch64-linux-gnu - -mlittle-endian -mabi=lp64 -mcpu=zeus+crypto+sha3+sm4+noprofile -fasynchronous-unwind-tables -fstack-protector-strong -Wformat -Wformat-security -fstack-clash-protection -dumpbase -

https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu?_ga=2.125309322.1012238192.1649067243-470348966.1648718387#_ftn3

According to an article from arm, seems -mcpu=native is the better choice, so should I replace -mcpu=neoverse-512tvb with '-mcpu=native'?

.NET page needs an update

.NET 7 has reached EOL.
.NET 9 is in preview, and is the next non-LTS release.

bug: EKS cluster running on M7g.large node broke overnight

I'm currently working on a graviton pattern in the cdk-eks-blueprints-patterns repository, and I left my cluster running overnight with 2 EKS addons, fluxcd and argocd.

When I went to look this morning, all of my pods were in an unknown state with the one for the aws-node daemonset being in a CrashLoopBackoff. Yesterday while I was working I had no issues, and I was constantly adding and taking off different addons.

I have tried scaling up to 3 nodes, but pods are still in unknown state. The AWS console shows no issues with the cluster or any of the nodes, and there are no issues with the instances in EC2.

ModuleNotFoundError after build and run

porting-advisor-for-graviton/dist [main] $ ./porting-advisor-macosx-arm64 ~/projects/pexa_app
Traceback (most recent call last):
File "porting-advisor.py", line 23, in
File "advisor/init.py", line 26, in main
File "PyInstaller/loader/pyimod03_importers.py", line 495, in exec_module
File "advisor/main.py", line 28, in
ModuleNotFoundError: No module named 'advisor.reports.report_factory'
[90036] Failed to execute script 'porting-advisor' due to unhandled exception!
araza porting-advisor-for-graviton/dist [main] $

Please update the C/C++ on Graviton page with Graviton3 flags

With today's announcement of Graviton3, it would be useful to have the tables at https://github.com/aws/aws-graviton-getting-started/blob/main/c-c%2B%2B.md updated with the optimal flags.

jamspell not getting installed on arm c6

Here are the logs:

pip3 install jamspell
Collecting jamspell
  Using cached jamspell-0.0.11.tar.gz (60 kB)
Building wheels for collected packages: jamspell
  Building wheel for jamspell (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-hdkdegey/jamspell/setup.py'"'"'; __file__='"'"'/tmp/pip-install-hdkdegey/jamspell/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-chqx4k0p
       cwd: /tmp/pip-install-hdkdegey/jamspell/
  Complete output (41 lines):
  running bdist_wheel
  running build
  running build_ext
  building '_jamspell' extension
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-install-hdkdegey/jamspell/setup.py", line 51, in <module>
      setup(
    File "/usr/lib/python3/dist-packages/setuptools/__init__.py", line 144, in setup
      return distutils.core.setup(**attrs)
    File "/usr/lib/python3.8/distutils/core.py", line 148, in setup
      dist.run_commands()
    File "/usr/lib/python3.8/distutils/dist.py", line 966, in run_commands
      self.run_command(cmd)
    File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "/usr/lib/python3/dist-packages/wheel/bdist_wheel.py", line 223, in run
      self.run_command('build')
    File "/usr/lib/python3.8/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "/tmp/pip-install-hdkdegey/jamspell/setup.py", line 33, in run
      self.run_command('build_ext')
    File "/usr/lib/python3.8/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "/usr/lib/python3.8/distutils/command/build_ext.py", line 340, in run
      self.build_extensions()
    File "/usr/lib/python3.8/distutils/command/build_ext.py", line 449, in build_extensions
      self._build_extensions_serial()
    File "/usr/lib/python3.8/distutils/command/build_ext.py", line 474, in _build_extensions_serial
      self.build_extension(ext)
    File "/usr/lib/python3.8/distutils/command/build_ext.py", line 506, in build_extension
      sources = self.swig_sources(sources, ext)
    File "/usr/lib/python3.8/distutils/command/build_ext.py", line 597, in swig_sources
      swig = self.swig or self.find_swig()
    File "/tmp/pip-install-hdkdegey/jamspell/setup.py", line 46, in find_swig
      assert subprocess.check_output([swigBinary, "-version"]).find(b'SWIG Version 3') != -1
  AssertionError
  ----------------------------------------
  ERROR: Failed building wheel for jamspell
  Running setup.py clean for jamspell
Failed to build jamspell
Installing collected packages: jamspell
    Running setup.py install for jamspell ... error
    ERROR: Command errored out with exit status 1:
     command: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-hdkdegey/jamspell/setup.py'"'"'; __file__='"'"'/tmp/pip-install-hdkdegey/jamspell/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-o98yr49h/install-record.txt --single-version-externally-managed --compile --install-headers /usr/local/include/python3.8/jamspell
         cwd: /tmp/pip-install-hdkdegey/jamspell/
    Complete output (34 lines):
    running install
    running build_ext
    building '_jamspell' extension
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-hdkdegey/jamspell/setup.py", line 51, in <module>
        setup(
      File "/usr/lib/python3/dist-packages/setuptools/__init__.py", line 144, in setup
        return distutils.core.setup(**attrs)
      File "/usr/lib/python3.8/distutils/core.py", line 148, in setup
        dist.run_commands()
      File "/usr/lib/python3.8/distutils/dist.py", line 966, in run_commands
        self.run_command(cmd)
      File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/tmp/pip-install-hdkdegey/jamspell/setup.py", line 39, in run
        self.run_command('build_ext')
      File "/usr/lib/python3.8/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/usr/lib/python3.8/distutils/command/build_ext.py", line 340, in run
        self.build_extensions()
      File "/usr/lib/python3.8/distutils/command/build_ext.py", line 449, in build_extensions
        self._build_extensions_serial()
      File "/usr/lib/python3.8/distutils/command/build_ext.py", line 474, in _build_extensions_serial
        self.build_extension(ext)
      File "/usr/lib/python3.8/distutils/command/build_ext.py", line 506, in build_extension
        sources = self.swig_sources(sources, ext)
      File "/usr/lib/python3.8/distutils/command/build_ext.py", line 597, in swig_sources
        swig = self.swig or self.find_swig()
      File "/tmp/pip-install-hdkdegey/jamspell/setup.py", line 46, in find_swig
        assert subprocess.check_output([swigBinary, "-version"]).find(b'SWIG Version 3') != -1
    AssertionError
    ----------------------------------------
ERROR: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-hdkdegey/jamspell/setup.py'"'"'; __file__='"'"'/tmp/pip-install-hdkdegey/jamspell/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-o98yr49h/install-record.txt --single-version-externally-managed --compile --install-headers /usr/local/include/python3.8/jamspell Check the logs for full command output.

Facing issues with NLP packages - Morfeusz & sentencepiece

sentencepiece is throwing the below error:

Downloading sentencepiece-0.1.91.tar.gz (500 kB)
    ERROR: Command errored out with exit status 1:
     command: /usr/bin/python3.7 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-vc7ylqzh/sentencepiece/setup.py'"'"'; __file__='"'"'/tmp/pip-install-vc7ylqzh/sentencepiece/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-84jrsx6u
         cwd: /tmp/pip-install-vc7ylqzh/sentencepiece/
    Complete output (5 lines):
    Package sentencepiece was not found in the pkg-config search path.
    Perhaps you should add the directory containing `sentencepiece.pc'
    to the PKG_CONFIG_PATH environment variable
    No package 'sentencepiece' found
    Failed to find sentencepiece pkgconfig

The packages Morfeusz is also not getting installed.

Is there any sort of documentation available for the same?

Update the name of this guide?

The AWS Graviton Getting Started Guide is offering more than just "getting started" materials today, going into deeper learning content including performance benchmarking and optimization. As it stands, this site is more of a "Graviton Technical Guide" helping both Graviton users and developers. Would it make sense update at least the title of the landing README.md page to reflect this? 😃

ps: Beyond the updated name reflected in the README.md, I suspect there'll be follow-on discussions/considerations around repository naming updates/redirects etc. This issue is focused on arriving at an agreed upon new name.

Processor extension for Graviton2

Hello,
This is sort of a blend between a quick question and an issue as nothing is actually broken. The assumption is that Graviton2 is a Neoverse N1 processor with an Armv8.2-A implementation as mentioned in README and other documents. However, GNU is reporting the architecture to be Armv8-A without specifying any extension. My doubt is why of this discrepancy. The two most obvious explanations are: (i) Graviton2 does not fully implement Armv8.2-A and GNU thus uses a more generic implementation; (ii) Graviton2 is a full Armv8.2-A implementation and GNU, for some reason, is not recognizing it. Maybe there is a 3rd explanation that I don't see. Thanks.

Any news regarding Beanstalk support for Graviton2 instances?

read and write msr

This guide explains PMU's but nothing about MSR's. I would like to test the impact of certain MSR's. Is there a recommended method similar to x86's rdmsr and wrmsr?

aws / aws-graviton-getting-started Goto Github PK

aws-graviton-getting-started's Introduction

AWS Graviton Technical Guide

Contents

Transitioning to Graviton

Building for Graviton

Optimizing for Graviton

Recent software updates relevant to Graviton

Containers on Graviton

Operating Systems

Known issues and workarounds

Postgres

Python installation on some Linux distros

Bazel on Linux

zlib on Linux

Blog Posts

HPC

Machine Learning

Other

Case Studies

HPC

Other

Additional resources

aws-graviton-getting-started's People

Contributors

Stargazers

Watchers

Forkers

aws-graviton-getting-started's Issues

Results

-march=armv8-a+crc+crypto

-mcpu=neoverse-512tvb+crc+crypto

OS distribution and System Info

exact commands I used for testing rocksdb

install phoronix-test-suite

execute benchmark

Gravtiton2(c6g.medium) results

Graviton3(c7g.medium) results

Scenario

c6a.8xlarge

c7g.8xlarge

Recommend Projects

Recommend Topics

Recommend Org