I've been trying to build a docker image by following the steps from INSTALL.md, but I

Another docker file with a different error: <div class="snippet-clipboard-content

This is as far as I got: <div class="highlight highlight-source-dockerfile notrans

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Can't setup the environment about oneformer HOT 6 CLOSED

nikolaydyankov commented on May 31, 2024

Can't setup the environment

from oneformer.

Comments (6)

nikolaydyankov commented on May 31, 2024

On a sidenote, adding a dockerfile in the /demo folder would be amazing.

from oneformer.

nikolaydyankov commented on May 31, 2024

Another docker file with a different error:

#0 16.18 /usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py:381: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
#0 16.18   warnings.warn(msg.format('we could not find ninja.'))
#0 16.18 Traceback (most recent call last):
#0 16.18   File "setup.py", line 69, in <module>
#0 16.18     setup(
#0 16.18   File "/usr/local/lib/python3.8/dist-packages/setuptools/__init__.py", line 153, in setup
#0 16.18     return distutils.core.setup(**attrs)
#0 16.18   File "/usr/lib/python3.8/distutils/core.py", line 148, in setup
#0 16.18     dist.run_commands()
#0 16.18   File "/usr/lib/python3.8/distutils/dist.py", line 966, in run_commands
#0 16.18     self.run_command(cmd)
#0 16.18   File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
#0 16.18     cmd_obj.run()
#0 16.18   File "/usr/lib/python3.8/distutils/command/build.py", line 135, in run
#0 16.18     self.run_command(cmd_name)
#0 16.18   File "/usr/lib/python3.8/distutils/cmd.py", line 313, in run_command
#0 16.18     self.distribution.run_command(command)
#0 16.18   File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
#0 16.18     cmd_obj.run()
#0 16.18   File "/usr/local/lib/python3.8/dist-packages/setuptools/command/build_ext.py", line 79, in run
#0 16.18     _build_ext.run(self)
#0 16.18   File "/usr/local/lib/python3.8/dist-packages/Cython/Distutils/old_build_ext.py", line 186, in run
#0 16.18     _build_ext.build_ext.run(self)
#0 16.18   File "/usr/lib/python3.8/distutils/command/build_ext.py", line 340, in run
#0 16.18     self.build_extensions()
#0 16.18   File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 735, in build_extensions
#0 16.18     build_ext.build_extensions(self)
#0 16.18   File "/usr/local/lib/python3.8/dist-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions
#0 16.18     _build_ext.build_ext.build_extensions(self)
#0 16.18   File "/usr/lib/python3.8/distutils/command/build_ext.py", line 449, in build_extensions
#0 16.18     self._build_extensions_serial()
#0 16.18   File "/usr/lib/python3.8/distutils/command/build_ext.py", line 474, in _build_extensions_serial
#0 16.18     self.build_extension(ext)
#0 16.18   File "/usr/local/lib/python3.8/dist-packages/setuptools/command/build_ext.py", line 202, in build_extension
#0 16.18     _build_ext.build_extension(self, ext)
#0 16.18   File "/usr/lib/python3.8/distutils/command/build_ext.py", line 528, in build_extension
#0 16.18     objects = self.compiler.compile(sources,
#0 16.18   File "/usr/lib/python3.8/distutils/ccompiler.py", line 574, in compile
#0 16.18     self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
#0 16.18   File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 483, in unix_wrap_single_compile
#0 16.18     cflags = unix_cuda_flags(cflags)
#0 16.18   File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 450, in unix_cuda_flags
#0 16.18     cflags + _get_cuda_arch_flags(cflags))
#0 16.18   File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1606, in _get_cuda_arch_flags
#0 16.18     arch_list[-1] += '+PTX'
#0 16.18 IndexError: list index out of range
------
failed to solve: executor failed running [/bin/sh -c cd oneformer/modeling/pixel_decoder/ops &&     sh ./make.sh]: exit code: 1

And here is the dockerfile:

FROM nvidia/cuda:11.3.1-devel-ubuntu20.04

RUN apt-get update && apt-get upgrade -y
RUN apt-get install -y --no-install-recommends \
    python3 python3-pip python3-dev build-essential \
    libgomp1 \
    git
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3 1
RUN python -m pip install --upgrade pip wheel

# Install PyTorch 1.10.1 and torchvision 0.11.2 with CUDA 11.3 support
RUN python -m pip install torch==1.10.1+cu113 torchvision==0.11.2+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

# Clone the OneFormer repository
RUN git clone https://github.com/SHI-Labs/OneFormer.git /OneFormer
RUN cd /OneFormer
WORKDIR /OneFormer

# Install detectron2 and other dependencies
RUN python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
RUN pip install git+https://github.com/cocodataset/panopticapi.git
RUN pip install git+https://github.com/mcordts/cityscapesScripts.git
RUN pip install -r requirements.txt

# Setup wand
RUN pip install wandb
#ENV WANDB_API_KEY=...
#RUN wandb login

# Setup MSDeformAttn
ENV CUDA_HOME=/usr/local/cuda-11.3
ENV FORCE_CUDA=1
RUN cd oneformer/modeling/pixel_decoder/ops && \
    sh ./make.sh

# Set the default command to run when starting the container
CMD ["/bin/bash"]

from oneformer.

nikolaydyankov commented on May 31, 2024

This is as far as I got:

FROM nvidia/cuda:11.3.1-devel-ubuntu20.04

# Set environment variables
ENV DEBIAN_FRONTEND=noninteractive
ENV LANG=C.UTF-8
ENV LC_ALL=C.UTF-8

# Update package list and install dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    wget \
    ca-certificates \
    git \
    build-essential \
    libglib2.0-0 \
    libsm6 \
    libxext6 \
    libxrender1 \
    libyaml-cpp-dev \
    libopencv-dev \
    && rm -rf /var/lib/apt/lists/*

# Install GCC, G++ 9
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc-9 \
    g++-9 \
    && rm -rf /var/lib/apt/lists/* \
    && update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 100 \
    && update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 100

# Install conda 4.12.0
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-py38_4.12.0-Linux-x86_64.sh -O miniconda.sh \
    && chmod +x miniconda.sh \
    && ./miniconda.sh -b -p /opt/conda \
    && rm miniconda.sh \
    && /opt/conda/bin/conda clean -tipsy \
    && ln -s /opt/conda/etc/profile.d/conda.sh /etc/profile.d/conda.sh \
    && echo ". /opt/conda/etc/profile.d/conda.sh" >> ~/.bashrc \
    && echo "conda activate base" >> ~/.bashrc

# Set some environment variables
ENV PATH /opt/conda/bin:$PATH
ENV WANDB_API_KEY=...
ENV CUDA_HOME=/usr/local/cuda
ENV FORCE_CUDA=1

# Clone OneFormer repository and set working directory
RUN git clone https://github.com/SHI-Labs/OneFormer.git /OneFormer
WORKDIR /OneFormer

# Install dependencies
RUN conda install pytorch==1.10.1 torchvision==0.11.2 cudatoolkit=11.3 -c pytorch -c conda-forge
RUN pip3 install -U opencv-python
RUN python -m pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/index.html
RUN pip3 install git+https://github.com/cocodataset/panopticapi.git
RUN pip3 install git+https://github.com/mcordts/cityscapesScripts.git
RUN pip3 install -r requirements.txt
#RUN pip3 install wandb
#RUN wandb login
RUN pip3 install colormap
RUN pip3 install easydev

# Setup MSDeformAttn
ENV TORCH_CUDA_ARCH_LIST="6.0;6.1;6.2;7.0;7.5;8.0;8.6+PTX"
RUN cd oneformer/modeling/pixel_decoder/ops && \
    chmod +x make.sh && \
    ./make.sh

# Downgrade numpy
RUN pip3 uninstall numpy -y
RUN pip3 install numpy==1.23.1

# Set the default command to run when starting the container
CMD ["/bin/bash"]

This image works, but the model can't be trained on RTX4090 due to a bug in pytorch:

Traceback (most recent call last):
  File "/OneFormer/workspace/oneformer-scripts/train.py", line 448, in <module>
    trainer.train()
  File "/opt/conda/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 484, in train
    super().train(self.start_iter, self.max_iter)
  File "/opt/conda/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 149, in train
    self.run_step()
  File "/opt/conda/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 494, in run_step
    self._trainer.run_step()
  File "/opt/conda/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 395, in run_step
    loss_dict = self.model(data)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/OneFormer/oneformer/oneformer_model.py", line 296, in forward
    losses = self.criterion(outputs, targets)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/OneFormer/oneformer/modeling/criterion.py", line 306, in forward
    indices = self.matcher(aux_outputs, targets)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/OneFormer/oneformer/modeling/matcher.py", line 202, in forward
    return self.memory_efficient_forward(outputs, targets)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/OneFormer/oneformer/modeling/matcher.py", line 161, in memory_efficient_forward
    cost_mask = batch_sigmoid_ce_loss_jit(out_mask, tgt_mask)
RuntimeError: nvrtc: error: invalid value for --gpu-architecture (-arch)

nvrtc compilation failed:

#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)


template<typename T>
__device__ T maximum(T a, T b) {
  return isnan(a) ? a : (a > b ? a : b);
}

template<typename T>
__device__ T minimum(T a, T b) {
  return isnan(a) ? a : (a < b ? a : b);
}

extern "C" __global__
void fused_neg_add(float* ttargets_1, float* aten_add) {
{
  float v = __ldg(ttargets_1 + (long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x));
  aten_add[(long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)] = (0.f - v) + 1.f;
}
}

I can't update the cuda version, otherwise MSDeformAttn doesn't build. This is the issue in the pytorch repo: pytorch/pytorch#87595 (comment)

from oneformer.

praeclarumjj3 commented on May 31, 2024

Hi @nikolaydyankov, thanks for your interest in our work. Did you take a look at the Dockerfile used for hosting our HuggingFace Space demo? If not, it might be worth a look.

from oneformer.

praveenVnktsh commented on May 31, 2024

I'm running into the same problem with the architecture mismatch. Unable to run on a RTX4090. I've temporarily replaced all the JIT functions with regular functions and it runs, but its very slow.

from oneformer.

linzy5 commented on May 31, 2024

@nikolaydyankov Hi, I encounter exactly the same problem as you. Thanks to @praeclarumjj3 , I found the key point in Dockerfile used in oneformer's huggingface space.
Two key command in the dockerfile is below:

ARG TORCH_CUDA_ARCH_LIST=7.5+PTX
RUN cd /path/to/ops && FORCE_CUDA=1 python setup.py build install

The TORCH_CUDA_ARCH_LIST seems need to change to fit your GPU and cuda version.~~

from oneformer.

Can't setup the environment about oneformer HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent