Coder Social home page Coder Social logo

shi-labs / natten Goto Github PK

View Code? Open in Web Editor NEW
336.0 10.0 25.0 12.98 MB

Neighborhood Attention Extension. Bringing attention to a neighborhood near you!

Home Page: https://shi-labs.com/natten/

License: Other

Shell 0.05% Python 4.90% C++ 41.12% Cuda 53.85% Makefile 0.01% CMake 0.07% Batchfile 0.01%
cuda pytorch neighborhood-attention

natten's Introduction

NATTENLogo NATTENLogo

| |

Neighborhood Attention Extension

Bringing attention to a neighborhood near you!

Visualization of neighborhood attention in 2D. Visualization of dilated neighborhood attention in 2D.

NATTEN is an open-source project dedicated to providing fast implementations for Neighborhood Attention, a sliding window self-attention mechanism.

If you're not familiar with neighborhood attention, please refer to our papers, or watch our YouTube video from CVPR 2023.

To read more about our GEMM-based and fused neighborhood attention kernels, please refer to our new preprint, Faster Neighborhood Attention.

New: Fused Neighborhood Attention now supports backpropagation!

We've released the Fused Neighborhood Attention (FNA) backward kernel and interface, which means you can now train models based on neighborhood attention faster and more efficiently.

FNA can be seen as a generalization of methods such as Flash Attention and FMHA from back-to-back matrix multiplication to back-to-back tensor-tensor contraction, and comes with neighborhood attention masking built in. This accelerates neighborhood attention, a multi-dimensional sliding window attention pattern, by never storing the attention tensor to global memory, which aside from reducing global memory footprint also reduces the memory bandwidth bottleneck.

Op-level average speedup.

We highly recommend referring to FNA quick start or the Fused vs unfused NA guide before starting to use FNA, since the interface, memory layout, and feature set can differ from all unfused ops in NATTEN.

Getting started

NATTEN supports PyTorch version 2.0 and later, and Python versions 3.8 and above. Python 3.12 is only supported with torch >= 2.2.0.

Older NATTEN releases supported python >= 3.7 and torch >= 1.8.

Please refer to install instructions to find out whether your operating system and hardware accelerator is compatible with NATTEN.

Feature availability

Problem space CPU backend CUDA backend
1D naive naive, gemm, fna
2D naive naive, gemm, fna
3D naive naive, fna

CPU

Problem space CPU Backend Causal masking Varying parameters Relative positional bias Autograd support
1D naive Forward and reverse mode
2D naive Forward and reverse mode
3D naive Forward and reverse mode

Notes:

  • Forward mode autograd does not support relative positional biases and causal masking yet.
  • Relative positional biases are not yet supported when any axis has causal masking enabled.

CUDA

Problem space CUDA Backend Causal masking Varying parameters Relative positional bias Autograd support Min. Arch
1D naive Forward and reverse mode SM35
2D naive Forward and reverse mode SM35
3D naive Forward and reverse mode SM35
1D gemm - - Forward and reverse mode SM70
2D gemm - - Forward and reverse mode SM70
1D fna Reverse mode SM50
2D fna Reverse mode SM50
3D fna Reverse mode SM50

Notes:

  • FP16 kernels are only available on SM50 and above*, and BF16 requires SM80 and above.
    • Naive FP16 kernels are only available on SM60 and above.
    • FNA FP16 kernels are only available on SM50 and above.
  • GEMM backend on SM70 and SM75 can only do FP16.
  • Tiled only implements 1/3 of the ops, is only implemented for 2D problems, and requires head dim = 32.
  • Forward mode autograd does not support relative positional biases and causal masking yet.
  • Relative positional biases are not yet supported when any axis has causal masking enabled.
  • Relative positional biases are not supported in FNA during backward pass.

Features that will likely no longer be worked on or improved:

  • Relative positional biases
    • There's just better alternatives that don't involve explicitly biasing the attention weight matrix, and they will be more performant on top of providing similar or better accuracy levels.
  • GEMM-based kernels
    • Since FNA covers more features than our unfused GEMM-based kernels, and we know it to be a better solution (please refer to Faster Neighborhood Attention for details), we do not plan to extend or improve these kernels.
    • This includes support for varying parameters, causal masking, and 3-D problems.

License

NATTEN is released under the MIT License.

Citation

@misc{hassani2024faster,
  title        = {Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level},
  author       = {Ali Hassani and Wen-Mei Hwu and Humphrey Shi},
  year         = 2024,
  url          = {https://arxiv.org/abs/2403.04690},
  eprint       = {2403.04690},
  archiveprefix = {arXiv},
  primaryclass = {cs.CV}
}
@inproceedings{hassani2023neighborhood,
  title        = {Neighborhood Attention Transformer},
  author       = {Ali Hassani and Steven Walton and Jiachen Li and Shen Li and Humphrey Shi},
  year         = 2023,
  booktitle    = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}
}
@misc{hassani2022dilated,
  title        = {Dilated Neighborhood Attention Transformer},
  author       = {Ali Hassani and Humphrey Shi},
  year         = 2022,
  url          = {https://arxiv.org/abs/2209.15001},
  eprint       = {2209.15001},
  archiveprefix = {arXiv},
  primaryclass = {cs.CV}
}

Acknowledgements

We thank NVIDIA, and the CUTLASS project and team for their efforts in creating and open-sourcing CUTLASS. We would also like to thank Haicheng Wu for his valuable feedback and comments which led to the creation of GEMM-based NA. We also thank Meta and the xFormers team for their FMHA kernel, which is what our Fused Neighborhood Attention kernel is based on. We thank the PyTorch project and team.

natten's People

Contributors

alihassanijr avatar anzr299 avatar fgodt avatar hakanardo avatar jxtps avatar lartpang avatar voletiv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

natten's Issues

Pure pytorch version

Great work!

I wonder is there pure pytorch version of natten, because some embedded devices do not support cuda.

macos build failed.

NATTEN/natten/src/cpp/natten1dqkrpb_cpu_kernel.cpp:59:26: error: expression is not assignable
                    updt[d1] += _qaddr[d1] * _kaddr[d1];
                    ~~~~~~~~ ^

with clang

Using attention as prediction

I have an image partitioned in grids, that is either an object or background. I want to discover how these objects are connected (create an adjacency matrix). In the case where I only have a list of objects rather than a grid that is sparsely populated with objects, I solved this by using normal scaled-dot product attention to predict an adjacency matrix. Now I want to apply this to images. Using natten2dqkrpb I can get the attention over neighborhood for every patch (last dimension is kernel_size*kernel_size).

My problem is that I need to know the h,w-indices for the respective neighborhoods to relate this attention back to my labels. This task must be done internally in natten2dqkrpb anyway. Is there a way I can get/calculate a (h, w, kernel_size*kernel_size, 2) tensor that holds all neighborhood indices? I guess that would be a subset of what natten already performs.

Error in PyTorch container for compiled and build version

Hi,

I am using the PyTorch container from NGC Catalog - NVIDIA and got problems when I tried to use the compiled version (cu117/torch1.13) and when I compile by myself.

The container version is the 22.08. According to the release page, this version include CUDA 11.7.1 and PyTorch 1.13.0a0+d321be6.

I ran the the container with the command docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --rm -it nvcr.io/nvidia/pytorch:22.08-py3
Then I checked the version with this:

root@bef9234ff2a2:/workspace# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

root@bef9234ff2a2:/workspace# python3 -c "import torch; print(torch.__version__)"
1.13.0a0+d321be6

Compiled version problem

I installed the compiled version without problems:

root@bef9234ff2a2:/workspace# pip3 install natten -f https://shi-labs.com/natten/wheels/cu117/torch1.13/index.html 
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Looking in links: https://shi-labs.com/natten/wheels/cu117/torch1.13/index.html
Collecting natten
  Downloading https://shi-labs.com/natten/wheels/cu117/torch1.13/natten-0.14.6%2Btorch1130cu117-cp38-cp38-linux_x86_64.whl (55.3 MB)
     |████████████████████████████████| 55.3 MB 861 kB/s 
Requirement already satisfied: packaging in /opt/conda/lib/python3.8/site-packages (from natten) (21.3)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.8/site-packages (from packaging->natten) (3.0.9)
Installing collected packages: natten
Successfully installed natten-0.14.6+torch1130cu117
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

Finally I imported the lib and I got an error:

root@bef9234ff2a2:/workspace# python3 -c "from natten import NeighborhoodAttention1D, NeighborhoodAttention2D"
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/natten/functional.py", line 28, in <module>
    from natten import _C
ImportError: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/python3.8/site-packages/natten/__init__.py", line 24, in <module>
    from .natten1d import NeighborhoodAttention1D
  File "/opt/conda/lib/python3.8/site-packages/natten/natten1d.py", line 28, in <module>
    from .functional import natten1dav, natten1dqkrpb
  File "/opt/conda/lib/python3.8/site-packages/natten/functional.py", line 30, in <module>
    raise ImportError(
ImportError: Failed to import NATTEN's CPP backend. This could be due to an invalid/incomplete install. Please uninstall NATTEN (pip uninstall natten) and re-install with the correct torch build: shi-labs.com/natten

Build from source problem

In a new docker, I installed ninja, downloaded the official repo, and ran pip install -e .. However, I got this long error:
https://gist.github.com/camilo-nunez/7e36a2f612bc8028fec180bffd1a091d

What can I do :c ?

no attribute 'natten3dqkrpb_forward'

Sorry for bothering you again.

Do the nattn package support 3-D NATTN already?
It seems that there is no such function and usage in the package.

AttributeError: module 'natten._C' has no attribute 'natten3dqkrpb_forward'

Best Regards,
Aaron

Causality

Great project and implementation!
A good feature addition would be causality;
it seems like it's already implemented (the rightmost edge pixel only looking at values before it) so if only this operation was accessible through python, it would be a good use case.

Relative encoding

Does natten support relative positional encoding as proposed by Shaw in "Self-Attention with Relative Position Representations“ in its attention modules?

CUDA version mismatch

I am trying to install natten in kaggle for OneFormer. Kaggle has cuda 12.1 version.
code:

!pip3 install natten -f https://shi-labs.com/natten/wheels/cu121/torch2.0.0/index.html

error:

Looking in links: https://shi-labs.com/natten/wheels/cu121/torch2.0.0/index.html
Collecting natten
  Downloading natten-0.14.6.tar.gz (505 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 505.7/505.7 kB 9.1 MB/s eta 0:00:00a 0:00:01
  Preparing metadata (setup.py) ... done
Requirement already satisfied: packaging in /opt/conda/lib/python3.10/site-packages (from natten) (23.1)
Building wheels for collected packages: natten
  Building wheel for natten (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [70 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-3.10
      creating build/lib.linux-x86_64-3.10/natten
      copying src/natten/__init__.py -> build/lib.linux-x86_64-3.10/natten
      copying src/natten/natten1d.py -> build/lib.linux-x86_64-3.10/natten
      copying src/natten/natten2d.py -> build/lib.linux-x86_64-3.10/natten
      copying src/natten/functional.py -> build/lib.linux-x86_64-3.10/natten
      copying src/natten/flops.py -> build/lib.linux-x86_64-3.10/natten
      creating build/lib.linux-x86_64-3.10/natten/csrc
      copying src/natten/csrc/natten2dav.h -> build/lib.linux-x86_64-3.10/natten/csrc
      copying src/natten/csrc/natten2dqkrpb.h -> build/lib.linux-x86_64-3.10/natten/csrc
      copying src/natten/csrc/context.h -> build/lib.linux-x86_64-3.10/natten/csrc
      copying src/natten/csrc/natten1dav.h -> build/lib.linux-x86_64-3.10/natten/csrc
      copying src/natten/csrc/natten.cpp -> build/lib.linux-x86_64-3.10/natten/csrc
      copying src/natten/csrc/natten1dqkrpb.h -> build/lib.linux-x86_64-3.10/natten/csrc
      creating build/lib.linux-x86_64-3.10/natten/csrc/cpu
      copying src/natten/csrc/cpu/natten_cpu_commons.h -> build/lib.linux-x86_64-3.10/natten/csrc/cpu
      copying src/natten/csrc/cpu/natten2dav_cpu_kernel.cpp -> build/lib.linux-x86_64-3.10/natten/csrc/cpu
      copying src/natten/csrc/cpu/natten1dav_cpu_kernel.cpp -> build/lib.linux-x86_64-3.10/natten/csrc/cpu
      copying src/natten/csrc/cpu/natten2dqkrpb_cpu_kernel.cpp -> build/lib.linux-x86_64-3.10/natten/csrc/cpu
      copying src/natten/csrc/cpu/natten1dqkrpb_cpu_kernel.cpp -> build/lib.linux-x86_64-3.10/natten/csrc/cpu
      creating build/lib.linux-x86_64-3.10/natten/csrc/cuda
      copying src/natten/csrc/cuda/natten1dqkrpb_cuda_kernel.cu -> build/lib.linux-x86_64-3.10/natten/csrc/cuda
      copying src/natten/csrc/cuda/natten1dav_cuda_kernel.cu -> build/lib.linux-x86_64-3.10/natten/csrc/cuda
      copying src/natten/csrc/cuda/natten_commons.cuh -> build/lib.linux-x86_64-3.10/natten/csrc/cuda
      copying src/natten/csrc/cuda/natten2dqkrpb_cuda_kernel.cu -> build/lib.linux-x86_64-3.10/natten/csrc/cuda
      copying src/natten/csrc/cuda/natten2dav_cuda_kernel.cu -> build/lib.linux-x86_64-3.10/natten/csrc/cuda
      running build_ext
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-rowagv61/natten_17f593ad740d4ad38ccb35df0695cbe1/setup.py", line 118, in <module>
          setup(
        File "/opt/conda/lib/python3.10/site-packages/setuptools/__init__.py", line 153, in setup
          return distutils.core.setup(**attrs)
        File "/opt/conda/lib/python3.10/distutils/core.py", line 148, in setup
          dist.run_commands()
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 966, in run_commands
          self.run_command(cmd)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/site-packages/wheel/bdist_wheel.py", line 343, in run
          self.run_command("build")
        File "/opt/conda/lib/python3.10/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/distutils/command/build.py", line 135, in run
          self.run_command(cmd_name)
        File "/opt/conda/lib/python3.10/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/opt/conda/lib/python3.10/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/opt/conda/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 79, in run
          _build_ext.run(self)
        File "/opt/conda/lib/python3.10/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
          _build_ext.build_ext.run(self)
        File "/opt/conda/lib/python3.10/distutils/command/build_ext.py", line 340, in run
          self.build_extensions()
        File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 499, in build_extensions
          _check_cuda_version(compiler_name, compiler_version)
        File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 387, in _check_cuda_version
          raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))
      RuntimeError:
      The detected CUDA version (12.1) mismatches the version that was used to compile
      PyTorch (11.3). Please make sure to use the same CUDA versions.

Can't use kernel sizes bigger than 13

I use very small patches (16 by 16 pixels). To still get a reasonable receptive field I need bigger kernels but I noticed that natten throws an error when kernel sizes get bigger than 13. Is there a reason for this cap?

RuntimeError: natten2dqkrpb_cuda_forward does not support kernel size 29

I noticed you say in the readme that arbitrary kernel sizes are new. Is this new update obtainable by pip commands?

Unexpected output shape if query has a different shape

Hi,

I'm trying to use it for 1D cross attentions. But it seems the output shape doesn't not align with the expected shape. Any ideas?

from natten.functional import natten1dav, natten1dqkrpb
from torch import nn


q = torch.randn(2, 10, 64, 32)
k = torch.randn(2, 10, 100, 32)
v = torch.randn(2, 10, 100, 32)

# with Natten
attn = natten1dqkrpb(q, k, None, 15, 1)
attn = attn.softmax(dim=-1)
print('attention shape', attn.shape, 'output shape: ', natten1dav(attn, v, 15, 1).shape)

# expected shape
attn = torch.randn(2, 10, 64, 100)
print('attention shape', attn.shape, 'output shape: ', (attn@v).shape)

output shapes:

attention shape torch.Size([2, 10, 64, 15]) output shape:  torch.Size([2, 10, 100, 32])
attention shape torch.Size([2, 10, 64, 100]) output shape:  torch.Size([2, 10, 64, 32])

version problem

My pytorch version is 1.8.0 and py3.8_cuda11.1_cudnn8.0.5_0, and I install NATTEN with pip install natten -f https://shi-labs.com/natten/wheels/cu111/torch1.8.0/index.html.

But following error is given (lack of kernel_size argument in source code):

attn = natten2dqkrpb(q, k, self.rpb, self.kernel_size, self.dilation)
TypeError: natten2dqkrpb() takes 4 positional arguments but 5 were given

source code:

def natten2dqkrpb(query, key, rpb, dilation):
    return NATTEN2DQKRPBFunction.apply(query, key, rpb, dilation)

About the dot product between q and k

Hello authors, thank you for releasing the great code base for computing Neighbor Attention.
I am trying to use the lib for computing the dot product between q and k to get the unnormalized attention scores.

Below is my code for reproducing the result.

import torch
from natten.functional import natten2dqk

H = 3
W = 3
q = torch.tensor(
    [[0., 1., 2.],
     [3., 4., 5.],
     [6., 7., 8.]])

k = torch.tensor(
    [[8., 5., 6.],
     [2., 3., 4.],
     [1., 6., 2.]])

q = q.view(1, 1, H, W, 1)
k = k.view(1, 1, H, W, 1)

attention = natten2dqk(
    query=q,
    key=k,
    kernel_size=2,
    dilation=0,
)
print(attention.long().squeeze())

"""attention = 
 tensor([[[ 0,  0,  0,  0],
          [ 8,  8,  8,  8],
          [16, 16, 16, 16]],
 
         [[24, 24, 24, 24],
          [32, 32, 32, 32],
          [40, 40, 40, 40]],
 
         [[48, 48, 48, 48],
          [56, 56, 56, 56],
          [64, 64, 64, 64]]]))
"""

I think the intuition is that we will compute the dot product between every entry in the q with the neighbor entries of k in the window of kernel_size x kernel_size.

For example, the first inner row of attention output is computed:

# in the first row of out
# [0 * 8, 0 * 5, 0 * 2, 0 * 3]
# -> [ 0,  0,  0,  0]

For the second inner row of attention:

# in the second row
# [1 * 5, 1 * 6, 1 * 3, 1 * 4]
# -> [ 5,  6,  3,  4]
# But we got [8, 8, 8, 8]?

# ... similar for other attention rows.

I know that my intuition would definitely wrong. May you correct me via this simple code?
Thank you so much beforehand!

Replace Vanilla Attention with Natten

Hello!!

Thank you for sharing the details on Natten. Your research is very promising. I am trying to use natten in place of vanilla attention on my existing ViT based encoder decoder architecture. Can you please share some details on this? This will be very helpful. My existing architecture breaks the input image into patches similar to vanilla ViT architecture. I am thinking if we could use natten on these patches it will be great.

Thank you in advance! Looking forward to hear on this :)

Code details about the neighborhood attention

Hi Ali,

I was reading the details of neighbor attention. However, I only find the code snippet in NATTEN/csrc/src/pytorch/na2d.cpp (line 73):

auto attn = torch::empty({batch_size, heads, height, width, kernel_size * kernel_size}, query.options()); DISPATCH_DEVICE(query.device(), na2d_qk_forward, query, key, bias, attn, batch_size, heads, height, width, dim, kernel_size, dilation); return attn;

seems like defined an empty attention array and return it?

May I know where the attention score between the query and key is computed and where the rpb is added?

(0.14.6) CUDA error in `attn = attn.softmax(dim=-1)`

Dear Natten team,
I came across a runtime issue within your code.
While I successfully trained a neural network using your model without any problems, however at inference time an error occurs, especially with datasets containing large images.

The error is:
RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

What usually happens is the following:

  • the code runs smoothly for some images
  • then the error [1] occurs on just the next image
  • then the error [2] occurs on every subsequent image

I checked for the RAM usage of the GPU, but it does not increase with time or cause out of memory.

Error [1]:

File "my_test.py", line 93, in <module>
    pred = model(img)
  File "/nas/homes/user/envs/trufor_natten/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/nas/homes/user/algoritmi/DiNAT_seg/nat.py", line 333, in forward
    return self.forward_tokens(x)
  File "/nas/homes/user/algoritmi/DiNAT_seg/nat.py", line 320, in forward_tokens
    x, xo = level(x)
  File "/nas/homes/user/envs/trufor_natten/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/nas/homes/user/algoritmi/DiNAT_seg/nat.py", line 207, in forward
    x = blk(x)
  File "/nas/homes/user/envs/trufor_natten/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/nas/homes/user/algoritmi/DiNAT_seg/nat.py", line 145, in forward
    x = self.attn(x)
  File "/nas/homes/user/envs/trufor_natten/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/nas/homes/user/envs/trufor_natten/lib/python3.7/site-packages/natten/natten2d.py", line 92, in forward
    attn = attn.softmax(dim=-1)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "my_test.py", line 118, in <module>
    pred = model(img)
  File "/nas/homes/user/envs/trufor_natten/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/nas/homes/user/envs/trufor_natten/lib/python3.7/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/nas/homes/user/envs/trufor_natten/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/nas/homes/user/envs/trufor_natten/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 447, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/nas/homes/user/envs/trufor_natten/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 444, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Error [2]

File "my_test.py", line 82, in <module>
    img = img.to(device)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

Inference piece of code where the error occurs is simply:

    for index, (img, path) in enumerate(tqdm(testloader)):
          img = img.to(device)
          model.eval()
          pred = model(img)

natten version: 0.14.6+torch1110cu113
pytorch version: 1.11.0
cudatoolkit version: 1.13.1

AttributeError: module 'natten._C' has no attribute 'na3d_qk_forward'

Sorry to bother you at weekends.
Hello, this is a really nice work, i would like to ask a question.
I follow the instruction to download the natten package 'pip3 install natten -f https://shi-labs.com/natten/wheels/cu116/torch1.12.1/index.html '.
When i run 'from natten import NeighborhoodAttention3D' , it shows 'ImportError: cannot import name 'NeighborhoodAttention3D' from 'natten' , but 2D works.
i aslo download the zip file from your github, i run again, and it shows that 'AttributeError: module 'natten._C' has no attribute 'na3d_qk_forward''. i dont know if the 3d version havent added to the natten package, or i got wrong.
Thanks a lot. Looking forward to your reply.

Support for bf16 fp16

Hi, I cannot find any information about the support for bf16 and fp16. Does the current library support any of them?

I found the information in the catalog. Then when is bf16 going to be supported?

Building from sources

I am trying to build natten from sources in WSL ububtu. I followed your instructions and made a fresh environment:

$ conda create -n natten python=3.11
$ conda activate natten
$ git clone https://github.com/SHI-Labs/NATTEN
$ cd NATTEN
$ pip install -r requirements.txt
$ pip install -e .

this gives error:

In file included from /home/daniel/NATTEN/csrc/autogen/src/cuda/gemm/1d/sm80/source_0.cu:3:
/home/daniel/NATTEN/csrc/./include/natten/cuda/gemm/na1d.cuh:32:10: fatal error: cutlass/cutlass.h: No such file or directory
32 | #include <cutlass/cutlass.h>

Do you forget to include cutlass headers in the project's build system?

pytorch version for latest natten

Thank you for your great work! May I ask if Pytorch 1.10.1 can install the latest version of NATTEN with GEMM? And does it necessarily require a Python version greater than 3.8? Is 3.7 okay? If there is any other strict version requirement that differ from the previous Natten version, please also tell me, really appreciate it!

`natten` 0.14.5 (CPU) issue regarding `natten2dqkrpb`

Dear NATTEN author(s)/team:

Since 0.14.5, our CI (transformers) encounters a problem with natten2dqkrpb, which mentions missing 1 required positional argument: 'dilation'. With 0.14.4, there is no such problem.

I can try to provide a (short) code snippet (without using the testing code in transformers), but I am wondering if there is a breaking change introduced in natten==0.14.5. Thank you :-)

Here is the error message we got

        # Compute NA between "query" and "key" to get the raw attention scores, and add relative positional biases.
>       attention_scores = natten2dqkrpb(query_layer, key_layer, self.rpb, 1)
E       TypeError: natten2dqkrpb() missing 1 required positional argument: 'dilation'

AttributeError

Hello, I would like to ask this error is because of the installation of natten problem?
AttributeError: module 'natten._C' has no attribute 'na2d_qk_forward'

Memory bottleneck of NATTEN

Here is one memory snapshot illustrated using torch.cuda.memory

I am testing memory usage of NATTEN.

截屏2023-11-28 18 14 55

My code is here.

model = NeighborhoodAttention2D(dim=D, kernel_size=25, dilation=1, num_heads=4).to(device)
optimizer = torch.optim.Adam(model.parameters())

def train(model, optimizer):
   
    fake_input = torch.randn(N, H, W, D, dtype=torch.float32).to(device)

    out = model.forward(fake_input)

    loss = out**2

    loss.sum().backward()
    optimizer.step()
    optimizer.zero_grad()

Separate arguments for query and key

I am using NeighborhoodAttention2D but it only takes one argument during the forward pass. I want to use it in decoder so I need to pass different key and query. These would have same dimensions in my case. I modified the function like this:

``
class NeighborhoodAttention2D(nn.Module):
"""
Neighborhood Attention 2D Module
"""

def __init__(
    self,
    dim,
    num_heads,
    kernel_size,
    dilation=1,
    bias=True,
    qkv_bias=True,
    qk_scale=None,
    attn_drop=0.0,
    proj_drop=0.0,
):
    super().__init__()
    self.num_heads = num_heads
    self.head_dim = dim // self.num_heads
    self.scale = qk_scale or self.head_dim**-0.5
    assert (
        kernel_size > 1 and kernel_size % 2 == 1
    ), f"Kernel size must be an odd number greater than 1, got {kernel_size}."
    self.kernel_size = kernel_size
    assert (
        dilation is None or dilation >= 1
    ), f"Dilation must be greater than or equal to 1, got {dilation}."
    self.dilation = dilation or 1
    self.window_size = self.kernel_size * self.dilation

    self.W_k = nn.Linear(dim, dim, bias=qkv_bias)
    self.W_v = nn.Linear(dim, dim, bias=qkv_bias)
    self.W_q = nn.Linear(dim, dim, bias=qkv_bias)

    if bias:
        self.rpb = nn.Parameter(
            torch.zeros(num_heads, (2 * kernel_size - 1), (2 * kernel_size - 1))
        )
        trunc_normal_(self.rpb, std=0.02, mean=0.0, a=-2.0, b=2.0)
    else:
        self.register_parameter("rpb", None)
    self.attn_drop = nn.Dropout(attn_drop)
    self.proj = nn.Linear(dim, dim)
    self.proj_drop = nn.Dropout(proj_drop)

def forward(self, query, key_value):
    B, Hp, Wp, C = query.shape  # (B, H, W, d_m)
    H, W = int(Hp), int(Wp)
    pad_l = pad_t = pad_r = pad_b = 0
    if H < self.window_size or W < self.window_size:
        pad_l = pad_t = 0
        pad_r = max(0, self.window_size - W)
        pad_b = max(0, self.window_size - H)
        x = pad(x, (0, 0, pad_l, pad_r, pad_t, pad_b))
        _, H, W, _ = x.shape

    # create queries, keys and values
    q = self.W_q(query).    view(B, H, W, self.num_heads, self.head_dim).permute(0,3,1,2,4)
    k = self.W_k(key_value).view(B, H, W, self.num_heads, self.head_dim).permute(0,3,1,2,4)
    v = self.W_v(key_value).view(B, H, W, self.num_heads, self.head_dim).permute(0,3,1,2,4)

    q = q * self.scale
    attn = natten2dqkrpb(q, k, self.rpb, self.kernel_size, self.dilation)
    attn = attn.softmax(dim=-1)
    attn = self.attn_drop(attn)
    x = natten2dav(attn, v, self.kernel_size, self.dilation)
    x = x.permute(0, 2, 3, 1, 4).reshape(B, H, W, C)
    if pad_r or pad_b:
        x = x[:, :Hp, :Wp, :]

    return self.proj_drop(self.proj(x))

def extra_repr(self) -> str:
    return (
        f"head_dim={self.head_dim}, num_heads={self.num_heads}, "
        + f"kernel_size={self.kernel_size}, dilation={self.dilation}, "
        + f"rel_pos_bias={self.rpb is not None}"
    )

``

About the parameter settings of the NeighborhoodAttention2D module

Could you please explain the meaning of dilation, window_size, and kernel_size in the NeighborhoodAttention2D module? It is observed that when calculating neighbor attention, only the dilation parameter needs to be passed. Does this parameter represent the neighborhood size?

Does not support kernel sizes greater than 13

natten2dqkrpb_cuda_forward is not supporting kernel sizes greater than 13. The maximum kernel size it works for is 13. Is this expected ?

>>> import torch
>>> from natten import NeighborhoodAttention2D
>>> 
>>> na2d = NeighborhoodAttention2D(dim=128, kernel_size=15, dilation=2, num_heads=2)
>>> na2d = na2d.to('cuda:0')
>>> 
>>> x = torch.randn(4, 256, 256, 128)
>>> x = x.to('cuda:0')
>>> 
>>> y = na2d(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/natten/natten2d.py", line 63, in forward
    attn = natten2dqkrpb(q, k, self.rpb, dilation)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/natten/functional.py", line 145, in natten2dqkrpb
    return NATTEN2DQKRPBFunction.apply(query, key, rpb, dilation)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/cuda/amp/autocast_mode.py", line 118, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/natten/functional.py", line 90, in forward
    attn = _C.natten2dqkrpb_forward(
RuntimeError: natten2dqkrpb_cuda_forward does not support kernel size 15

test_autograd_cuda_gemm (test_na1d.NA1DTests) ... NATTEN failure: cutlass error: Error Internal at: 100

Hi, your work is great! May I ask you a question?
I tried to build NATTEN on V100 but failed. My environment is python=3.8, torch=1.10.1, cuda=11.3.
I bulit NATTEN from source and conducted "make WORKERS=2", but when I conduct "make test", it shows:
Running unit tests
python -m unittest discover -v -s ./tests
test_autograd_cpu (test_na1d.NA1DTests) ... ok
test_autograd_cuda_gemm (test_na1d.NA1DTests) ... NATTEN failure: cutlass error: Error Internal at: 100

Can you help me find out what's causing this?

How can see the details of natten's attention score calculation (q · k)

Sorry to bother you during the weekend.
This is a really nice work, but I would like to ask some questions:
First I want to know how this statement works:
from natten import _C
Is this the specific implementation of importing natten from a dynamic link library?
I also want to know how I can see the specific calculation details of the attention score, which cannot be displayed to me in Pycharm, like this:
attn = _C.natten2dqkrpb_forward(query, key, rpb, kernel_size, dilation)

`def natten2dqkrpb_forward(arg0, arg1, arg2, torch_Tensor=None, *args, **kwargs): # real signature unknown; NOTE: unreliably restored from doc
"""
natten2dqkrpb_forward(arg0: torch.Tensor, arg1: torch.Tensor, arg2: Optional[torch.Tensor], arg3: int, arg4: int) -> torch.Tensor

NATTEN2DQK+RPB forward
"""
pass`

_ C. The header annotation of the py file is:
`

encoding: utf-8

module natten._C

from /mnt/c/Users/zzz/AppData/Local/JetBrains/PyCharm2021.3/remote_sources/380714796/-89893896/natten/_C.cpython-38-x86_64-linux-gnu.so

by generator 1.147

no doc

no imports

`
What should I do if I want to know the specific attention score calculation details?
Looking forward to receiving your reply very much!

CUDA version (12.0)

Why CUDA version (12.0) is not allowed?

Pytorch 11.8 allows the usage of CUDA 12.0 drivers and currently has no compatibility issues.

Wheel file for torch 2.0 (cpu/cu117/cu118)

Dear NATTEN author/team,

torch 2.0 is going to be released soon (I believe it is this week). I am wondering if there are already wheel files for natten with torch 2.0, or there will be such wheel files soon after the torch 2.0 release.

Thank you 🤗

NATTEN is not compiled with CUDA!

Thanks for your outstanding work. I'm currently trying to implement natten in my project, but I've come across an issue:

File "/data/Projects/NATTEN/env_nat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/data/Projects/NATTEN/env_nat/lib/python3.8/site-packages/natten/natten2d.py", line 91, in forward
attn = natten2dqkrpb(q, k, self.rpb, self.kernel_size, self.dilation)
File "/data/Projects/NATTEN/env_nat/lib/python3.8/site-packages/natten/functional.py", line 186, in natten2dqkrpb
return NATTEN2DQKRPBFunction.apply(query, key, rpb, kernel_size, dilation)
File "/data/Projects/NATTEN/env_nat/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 118, in decorate_fwd
return fwd(*args, **kwargs)
File "/data/Projects/NATTEN/env_nat/lib/python3.8/site-packages/natten/functional.py", line 119, in forward
attn = _C.natten2dqkrpb_forward(query, key, rpb, kernel_size, dilation)
RuntimeError: NATTEN is not compiled with CUDA! Please make sure you installed correctly by referring to shi-labs.com/natten.

Here are the versions of the corresponding packages:

natten 0.14.6
torch 1.11.0+cu102

Python 3.8.10
torch.version '1.11.0+cu102'
torch._C._cuda_getCompiledVersion() 10020

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89

natten is installed with:

pip3 install natten -f https://shi-labs.com/natten/wheels/cu102/torch1.11.0/index.html

Could you please help me with this?

CUDA error: out of memory

When I run following code

import torch
from natten import NeighborhoodAttention2D

device = torch.device("cuda:5")
na2d = NeighborhoodAttention2D(dim=128, kernel_size=7, dilation=2, num_heads=4).to(device)
x = torch.rand((1, 64, 64, 128)).to(device)
y = na2d(x)

But OOM:

attn = attn.softmax(dim=-1)
RuntimeError: CUDA error: out of memory

Here is response of nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:1A:00.0 Off |                  N/A |
| 78%   68C    P2   321W / 350W |  23294MiB / 24268MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:1B:00.0 Off |                  N/A |
|100%   81C    P2   326W / 350W |   8768MiB / 24268MiB |     65%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  On   | 00000000:1C:00.0 Off |                  N/A |
| 90%   74C    P2   311W / 350W |  21104MiB / 24268MiB |     73%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  On   | 00000000:1D:00.0 Off |                  N/A |
|100%   80C    P2   324W / 350W |  21104MiB / 24268MiB |     73%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce ...  On   | 00000000:1E:00.0 Off |                  N/A |
| 94%   76C    P2   318W / 350W |  21104MiB / 24268MiB |     98%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce ...  On   | 00000000:3D:00.0 Off |                  N/A |
| 31%   39C    P8    22W / 350W |      3MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce ...  On   | 00000000:3F:00.0 Off |                  N/A |
| 90%   73C    P2   318W / 350W |  21104MiB / 24268MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce ...  On   | 00000000:40:00.0 Off |                  N/A |
| 87%   74C    P2   316W / 350W |  21104MiB / 24268MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   8  NVIDIA GeForce ...  On   | 00000000:41:00.0 Off |                  N/A |
| 82%   71C    P2   314W / 350W |  11930MiB / 24268MiB |     60%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Windows size choices

I am confused about why the windows size of the NATTEN should be odd number?

        assert (
            kernel_size > 1 and kernel_size % 2 == 1
        ), f"Kernel size must be an odd number greater than 1, got {kernel_size}."

Trouble Installing

Hi, was wondering if someone can help with installing this. I am getting the below issues when trying to install.

pip3 install natten -f https://shi-labs.com/natten/wheels/cu101/torch1.12/index.html
Looking in links: https://shi-labs.com/natten/wheels/cu101/torch1.12/index.html
Collecting natten
  Using cached natten-0.14.2-4.tar.gz (11 kB)
  Preparing metadata (setup.py) ... done
Discarding https://files.pythonhosted.org/packages/61/5c/41704bc27a54c0a3cf5bc2a7a98b488382524a1b68cf395e25d2591dea55/natten-0.14.2-4.tar.gz (from https://pypi.org/simple/natten/) (requires-python:>=3.7): Requested natten from https://files.pythonhosted.org/packages/61/5c/41704bc27a54c0a3cf5bc2a7a98b488382524a1b68cf395e25d2591dea55/natten-0.14.2-4.tar.gz has inconsistent version: expected '0.14.2.post4', but metadata has '0.14.2+torch112cu102'
ERROR: Could not find a version that satisfies the requirement natten (from versions: 0.14.1, 0.14.2.post4)
ERROR: No matching distribution found for natten

I have also tried using pip's legacy resolver leading to the below issues. I have also tried building from source which leads to the same issue as below.

pip3 install natten --use-deprecated=legacy-resolver -f https://shi-labs.com/natten/wheels/cu101/torch1.12/index.html
Looking in links: https://shi-labs.com/natten/wheels/cu101/torch1.12/index.html
Collecting natten
  Using cached natten-0.14.2-4.tar.gz (11 kB)
  Preparing metadata (setup.py) ... done
Collecting packaging
  Using cached packaging-21.3-py3-none-any.whl (40 kB)
Collecting pyparsing!=3.0.5,>=2.0.2
  Using cached pyparsing-3.0.9-py3-none-any.whl (98 kB)
Building wheels for collected packages: natten
  Building wheel for natten (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [88 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-cpython-310
      creating build/lib.linux-x86_64-cpython-310/natten
      copying natten/__init__.py -> build/lib.linux-x86_64-cpython-310/natten
      copying natten/flops.py -> build/lib.linux-x86_64-cpython-310/natten
      copying natten/functional.py -> build/lib.linux-x86_64-cpython-310/natten
      copying natten/natten1d.py -> build/lib.linux-x86_64-cpython-310/natten
      copying natten/natten2d.py -> build/lib.linux-x86_64-cpython-310/natten
      running build_ext
      building 'natten._C' extension
      creating /tmp/pip-install-3gvwtk3x/natten/build/temp.linux-x86_64-cpython-310
      creating /tmp/pip-install-3gvwtk3x/natten/build/temp.linux-x86_64-cpython-310/tmp
      creating /tmp/pip-install-3gvwtk3x/natten/build/temp.linux-x86_64-cpython-310/tmp/pip-install-3gvwtk3x
      creating /tmp/pip-install-3gvwtk3x/natten/build/temp.linux-x86_64-cpython-310/tmp/pip-install-3gvwtk3x/natten
      creating /tmp/pip-install-3gvwtk3x/natten/build/temp.linux-x86_64-cpython-310/tmp/pip-install-3gvwtk3x/natten/natten
      creating /tmp/pip-install-3gvwtk3x/natten/build/temp.linux-x86_64-cpython-310/tmp/pip-install-3gvwtk3x/natten/natten/src
      Emitting ninja build file /tmp/pip-install-3gvwtk3x/natten/build/temp.linux-x86_64-cpython-310/build.ninja...
      Compiling objects...
      Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
      ninja: error: '/tmp/pip-install-3gvwtk3x/natten/natten/src/natten.cpp', needed by '/tmp/pip-install-3gvwtk3x/natten/build/temp.linux-x86_64-cpython-310/tmp/pip-install-3gvwtk3x/natten/natten/src/natten.o', missing and no known rule to make it
      Traceback (most recent call last):
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1808, in _run_ninja_build
          subprocess.run(
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/subprocess.py", line 524, in run
          raise CalledProcessError(retcode, process.args,
      subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

      The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-3gvwtk3x/natten/setup.py", line 105, in <module>
          setup(
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/__init__.py", line 87, in setup
          return distutils.core.setup(**attrs)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
          return run_commands(dist)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
          dist.run_commands()
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 973, in run_commands
          self.run_command(cmd)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/dist.py", line 1217, in run_command
          super().run_command(command)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 992, in run_command
          cmd_obj.run()
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/wheel/bdist_wheel.py", line 299, in run
          self.run_command('build')
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 319, in run_command
          self.distribution.run_command(command)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/dist.py", line 1217, in run_command
          super().run_command(command)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 992, in run_command
          cmd_obj.run()
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/command/build.py", line 24, in run
          super().run()
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/command/build.py", line 132, in run
          self.run_command(cmd_name)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 319, in run_command
          self.distribution.run_command(command)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/dist.py", line 1217, in run_command
          super().run_command(command)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 992, in run_command
          cmd_obj.run()
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 79, in run
          _build_ext.run(self)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 346, in run
          self.build_extensions()
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 765, in build_extensions
          build_ext.build_extensions(self)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 466, in build_extensions
          self._build_extensions_serial()
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 492, in _build_extensions_serial
          self.build_extension(ext)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 202, in build_extension
          _build_ext.build_extension(self, ext)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 547, in build_extension
          objects = self.compiler.compile(
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 586, in unix_wrap_ninja_compile
          _write_ninja_file_and_compile_objects(
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1487, in _write_ninja_file_and_compile_objects
          _run_ninja_build(
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1824, in _run_ninja_build
          raise RuntimeError(message) from e
      RuntimeError: Error compiling objects for extension
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for natten
  Running setup.py clean for natten
Failed to build natten
Installing collected packages: pyparsing, packaging, natten
  Running setup.py install for natten ... error
  error: subprocess-exited-with-error

  × Running setup.py install for natten did not run successfully.
  │ exit code: 1
  ╰─> [92 lines of output]
      running install
      /home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
        warnings.warn(
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-cpython-310
      creating build/lib.linux-x86_64-cpython-310/natten
      copying natten/__init__.py -> build/lib.linux-x86_64-cpython-310/natten
      copying natten/flops.py -> build/lib.linux-x86_64-cpython-310/natten
      copying natten/functional.py -> build/lib.linux-x86_64-cpython-310/natten
      copying natten/natten1d.py -> build/lib.linux-x86_64-cpython-310/natten
      copying natten/natten2d.py -> build/lib.linux-x86_64-cpython-310/natten
      running build_ext
      building 'natten._C' extension
      creating /tmp/pip-install-3gvwtk3x/natten/build/temp.linux-x86_64-cpython-310
      creating /tmp/pip-install-3gvwtk3x/natten/build/temp.linux-x86_64-cpython-310/tmp
      creating /tmp/pip-install-3gvwtk3x/natten/build/temp.linux-x86_64-cpython-310/tmp/pip-install-3gvwtk3x
      creating /tmp/pip-install-3gvwtk3x/natten/build/temp.linux-x86_64-cpython-310/tmp/pip-install-3gvwtk3x/natten
      creating /tmp/pip-install-3gvwtk3x/natten/build/temp.linux-x86_64-cpython-310/tmp/pip-install-3gvwtk3x/natten/natten
      creating /tmp/pip-install-3gvwtk3x/natten/build/temp.linux-x86_64-cpython-310/tmp/pip-install-3gvwtk3x/natten/natten/src
      Emitting ninja build file /tmp/pip-install-3gvwtk3x/natten/build/temp.linux-x86_64-cpython-310/build.ninja...
      Compiling objects...
      Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
      ninja: error: '/tmp/pip-install-3gvwtk3x/natten/natten/src/natten.cpp', needed by '/tmp/pip-install-3gvwtk3x/natten/build/temp.linux-x86_64-cpython-310/tmp/pip-install-3gvwtk3x/natten/natten/src/natten.o', missing and no known rule to make it
      Traceback (most recent call last):
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1808, in _run_ninja_build
          subprocess.run(
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/subprocess.py", line 524, in run
          raise CalledProcessError(retcode, process.args,
      subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

      The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-3gvwtk3x/natten/setup.py", line 105, in <module>
          setup(
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/__init__.py", line 87, in setup
          return distutils.core.setup(**attrs)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
          return run_commands(dist)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
          dist.run_commands()
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 973, in run_commands
          self.run_command(cmd)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/dist.py", line 1217, in run_command
          super().run_command(command)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 992, in run_command
          cmd_obj.run()
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/command/install.py", line 68, in run
          return orig.install.run(self)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/command/install.py", line 692, in run
          self.run_command('build')
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 319, in run_command
          self.distribution.run_command(command)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/dist.py", line 1217, in run_command
          super().run_command(command)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 992, in run_command
          cmd_obj.run()
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/command/build.py", line 24, in run
          super().run()
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/command/build.py", line 132, in run
          self.run_command(cmd_name)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 319, in run_command
          self.distribution.run_command(command)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/dist.py", line 1217, in run_command
          super().run_command(command)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 992, in run_command
          cmd_obj.run()
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 79, in run
          _build_ext.run(self)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 346, in run
          self.build_extensions()
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 765, in build_extensions
          build_ext.build_extensions(self)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 466, in build_extensions
          self._build_extensions_serial()
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 492, in _build_extensions_serial
          self.build_extension(ext)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 202, in build_extension
          _build_ext.build_extension(self, ext)
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 547, in build_extension
          objects = self.compiler.compile(
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 586, in unix_wrap_ninja_compile
          _write_ninja_file_and_compile_objects(
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1487, in _write_ninja_file_and_compile_objects
          _run_ninja_build(
        File "/home/ciaran.byrne/anaconda3/envs/natten/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1824, in _run_ninja_build
          raise RuntimeError(message) from e
      RuntimeError: Error compiling objects for extension
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> natten

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

Comparison to FlashAttention

Hi, and thanks for this great project.

I was curious if NATTEN incorporates any of the hardware optimizations from FlashAttention. How does its performance compare to FA for a give set of (comparable) hyperparameters? The paper only mentions FA in a sidenote, but I believe that the general approach should be applicable to neighborhood attention as well. However, I understand that introducing such low-level optimizations can be challenging and time-intensive.

NeighborhoodAttention3D

When I use the NeighborhoodAttention3D class, the error is reported as following "AttributeError: module 'natten._C' has no attribute 'na3d_qk_forward'". I have installed natten, but there was no "na3d_qk_forward" in the _C.py file.

NATTEN TO ONNX

Hi,
trying to convert to onnx but natten module can not be converted.
Is there any solution for this kind of problem ?

issue created when getting to:

_C.natten2dav_forward(
attn,
value,
dilation)

and other functions as well

Implementation details

I'm writing my own CUDA extension using your awsome code as the template. Here are some questions about the implementation details:

(1) Why do you use torch::DefaultPtrTraits instead of torch::RestrictPtrTraits as recommended by pytorch official extension guide? Is it for performance consideration?

using Tensor5D = typename torch::PackedTensorAccessor32<scalar_t,5,torch::DefaultPtrTraits>;

(2) Why do you calculate offsets manually, instead of using indexing operator [] as done in official guide ?

int weightsOffset = b * weights.stride(0) + h * weights.stride(1) + i * weights.stride(2) + j * weights.stride(3);

Neighborhood-Attention whether there is a requirement for the size of the tensor?

I would like to ask when I have changed the self-attention of transformer to natten and this error occurs.

File "/root/autodl-tmp/pycharm_project_983/basicsr/archs/my_arch.py", line 131, in forward
x = self.attn(x)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.virtualenvs/pythonProject/lib/python3.8/site-packages/natten/natten2d.py", line 95, in forward
attn = natten2dqkrpb(q, k, self.rpb, self.kernel_size, self.dilation)
File "/root/.virtualenvs/pythonProject/lib/python3.8/site-packages/natten/functional.py", line 186, in natten2dqkrpb
return NATTEN2DQKRPBFunction.apply(query, key, rpb, kernel_size, dilation)
File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/root/miniconda3/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 106, in decorate_fwd
return fwd(*args, **kwargs)
File "/root/.virtualenvs/pythonProject/lib/python3.8/site-packages/natten/functional.py", line 119, in forward
attn = _C.natten2dqkrpb_forward(query, key, rpb, kernel_size, dilation)
RuntimeError: numel needs to be smaller than int32_t max; otherwise, please use packed_accessor64

module 'natten' has no attribute 'has_cuda'

Hi!

Thanks for sharing the code.
I have install successfully with NATTEN extension, and the modules, like NeighborhoodAttention2D, can be imported

However, whenever I type

import natten
print(natten.has_cuda())

It always return "module 'natten' has no attribute 'has_cuda'".

Does that mean I am not using the extension to acclerate the operation?
Thanks a lot.

cant build it from source for Windows

Hello,

I have been trying to install natten on Windows from source. Unfortunately unsuccessful. I have tried lot of different approaches. It would be nice if you could help me in this. Here is the error I m getting.

image

Best regards,
Roberts

NA with a mask for the tensor?

Thanks for your great work! I am experiencing some difficulties and would very much like your advice and help.

I am considering ways to reduce the number of attention computations. In my setup, I have a 2D tensor with a shape of [N,H,W,C]. Additionally, I possess a mask with data type 'bool' and a shape of [N,H,W,1], which indicates valid positions in the tensor if the value is 1.

To reduce the attention computations, one idea I have is to exclude pixels from positions where mask=0 from being qkv in the calculations, thereby lowering computational complexity. I am unsure if this can be implemented in NA and would love to hear your thoughts. Any advice would be incredibly helpful to me. Thanks for your time!

Forward mode autodiff support

I have a use case for forward mode autodiff (JVP) support, and support for backprop through the JVP, for NATTEN. It is for distilling diffusion models that use NATTEN (paper in progress) into consistency models.

I think the JVP of the NATTEN ops can be expressed in terms of the existing CUDA kernels so there should be no need to write new kernels for this. The tangent of natten2dqk(q_primal, k_primal), for instance, is natten2dqk(q_tangent, k_primal) + natten2dqk(q_primal, k_tangent), the same pattern holds for natten2dav and the kernels for other numbers of dimensions. (I do not know what happens with the rpb but I don't use it, I use RoPE instead.) Since you provide backward passes for these this also provides the ability to backprop through the JVP output, which I need.

Thank you,
Katherine Crowson

various checks for natten status do not work

import natten

Whether NATTEN was built with CUDA

print(natten.has_cuda())

Whether NATTEN with CUDA was built with support for float16

print(natten.has_half())

Whether NATTEN with CUDA was built with support for bfloat16

print(natten.has_bfloat())

Whether NATTEN with CUDA was built with the new GEMM kernels

print(natten.has_gemm())

natten does not have any of these attributes, did you change the names of these?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.