yformer / efficientsam Goto Github PK

View Code? Open in Web Editor NEW

2.0K 2.0K 143.0 334.61 MB

EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything

License: Apache License 2.0

Jupyter Notebook 97.97% Python 2.01% Shell 0.02%

efficientsam's People

Contributors

Stargazers

Watchers

Forkers

tuskaw m8e codeaudit techthiyanes almighty79251 mikey240 mathpopo yes7rose whuhxb ibrandiay darth7sidious ltyanghuang petercao weykon panthole-s-lab prog-ape davidchoi76 istorywar prorhubheadlinte tonyonst-t snowstache cdissensli energyou5 gossipak76 gutsybotstamaha narcommaf-freddarth essents-g tnumber1cali stamaha72 shigherit janssma75 halliele17 eatsyouyugi-girlsimon latinati41 scoopenligotiz captail-dirtypanet l-stripper ladywib leonloves-glenesto beeag80 ledgertwo-stunnaani embrack16 paybaxanekowolfie iminetd37 rrafraf gotiz-50 y-gotiz letsuchem-l ellynnon72lunetes mandralg60 supermario-ai tingtingch adambear hhy5277 rochemedia rkp64 rockystevejobs hadryan healthonrails eltociear ai-machine-vision-lab faisalshahbaz yfh-yufeihu m3bdelwahab asdlei99 robertknight seaotocinclus f901107 torgeirgrini edmuthiah jayyshah tonywhite11 maralzar a7mad-magdy77 ooropuloo labelmeai andrewhazelden hongjunsong1 lianjinke vvhj colm-brandon-ul lpylpy0514 zhuzi6 gu0zj yi-coder18 skwarson96 linhong00316 kp-forks nielsrogge warmtan cunminzhao rashitech ashu-agrawal1 marcbs thesantatitan jpeemaurya mirasbrisas dmmaze navezjt ymzlygw

efficientsam's Issues

The output mask is 3wh. I need to take the masked channel as the result. Or should I merge these channels?

Title: Query on EfficientSAM's Adaptability for Low-Resource Environments

Dear EfficientSAM Contributors,

I hope this message finds you well. I have been closely following the development of EfficientSAM and am thoroughly impressed with the strides made in efficient image segmentation. The recent release of the torchscript version and the accompanying Colab notebook have been particularly helpful.

However, I am curious about the model's performance and adaptability in low-resource environments, which is often a challenge in the field of computer vision. Specifically, I am interested in understanding how EfficientSAM fares on devices with limited computational power and memory constraints.

Could you provide insights or benchmarks on the following aspects?

The model's performance on CPUs with lower clock speeds and fewer cores, as compared to the recommended specifications.
The memory footprint of EfficientSAM when deployed in a constrained environment, and any potential trade-offs that might need to be considered.
Any recommended practices or modifications for optimising EfficientSAM's deployment on such devices without significantly compromising on accuracy or speed.

I believe addressing these queries would greatly benefit researchers and practitioners working in regions with limited access to high-end hardware, thereby broadening the scope of EfficientSAM's applicability.

Thank you for your time and the remarkable work on this project. I look forward to your response and any guidance you can provide.

Best regards,
yihong1120

models for other devices

Hi Amazing work,
can you upload jit models for Mac ("mps") and cpu?
It will be awesome thank you (:

Efficient SAM has no response to negative point prompt

I tested the ‘S’ model on my own dataset.
I intended to use negative points to eliminate some confusing regions.
However, it appears that the model shows almost no response to the negative points.
I quickly glanced through the paper but found no relevant information.
Is this behavior of the model by design, or have I overlooked any important detail?

Using SAM original prompt encoder with EfficientSAM

HEllo, I would like to use masks as input prompts to EfficientSAm, but with what I have seen, it seems that it is not possible with the current implementation.

Could I use the outputs of the original SAM prompt encoder and combine them with the result of the VIT small to get the same results as with all the "official" EfficientSAM stuff?

Regards,
Carlos

What are bounding box labels for?

In this code:

def run_ours_box(img_path, pts_sampled, model):
    bbox = torch.reshape(torch.tensor(pts_sampled), [1, 1, 2, 2])
    bbox_labels = torch.reshape(torch.tensor([2, 3]), [1, 1, 2])
    image = cv2.imread(image_path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    predicted_logits, predicted_iou = model(
        img_tensor[None, ...].cuda(),
        bbox.cuda(),
        bbox_labels.cuda(),
    )

... what are the bbox_labels doing? Why are they [2,3], and why does the model need them? I couldn't find that described in the paper either. Maybe I missed it...

thx!

the efficientSAM model do not support input bounding box-prompt?

the efficientSAM model do not support input multi-bounding box?

Finetuning for downstream segmentation task

Hi there - I noticed the new notebooks added for segment anything example. I am wondering if you can give any advice for how to finetune it on a custom dataset for semantic segmentation? Will there be another notebook released which will cover this?

Thanks

Multi-bbox inference

Thanks for sharing this repo.

On the demo collab file, how we can pass multi bounding boxes to the model as prompt?

I have a widget which gets the bboxs by the users, and I want to pass it to the model like FastSAM.

What does input label mean in code?

`input_points = torch.tensor([[[[580, 350], [650, 350]]]])
input_labels = torch.tensor([[[1, 1]]])

redicted_logits, predicted_iou = model(
sample_image_tensor[None, ...],
input_points,
input_labels,
)
`

How to do Saliency segmentation?

Hello, thank you for your nice job and I want to use effieicnet Sam to Identify the salientest object in the picture and mask it.
But I didn't find this keyword when I searched in the github repository
Best~!

How to inspect the intermediate output of the model?

Hi, firstly very appreciate your work. However, I noticed that the jit encapsulation made it impossible for me to inspect the intermediate output (e.g. encoder feature, decoder feature). Is there any possibility for you to release the model code?

what is the minimum memory required for the GPU

When I use the nvidia 3090 GPU (Memory 24576M) to run the “EfficientSAM_segment_everything_example” code, the model used is vitt, GRID_SIZE = 32， which indicates that the graphics card memory is not enough and cannot run。

when I adjust parameter 32 to 16, it can be run but the segment effect is not good

Regression from original

I've tested your torchscript models on linux with A6000 gpu.

Regression 1:
Running the original sam huge model, I get 3 output masks from a point prompt, but efficient-sam only outputs 1 mask.

Regression 2:
The original sam offers 2 stage inference using SamPredictor.setimage and SamPredictor.predict but efficient-sam has only 1 stage of inference.

Speed:
Original sam (huge) inference stage 1 - about 450ms, stage 2 - 25ms. This means interactive segmentation adjusting points/boxes can be done at 40fps.
Efficient sam (small) inference one stage - about 105ms. Overall 4-5 times faster but now rather blocky for interactive picking.
Efficient sam tiny - about 75ms. Mask quality not so great.

I haven't looked into the mask quality much and whether I should be comparing one of the smaller sam models. I'd be interested to know your thoughts/plans. Thanks for the work.

Expected data type and value range of the input images

Hello I was wondering what is the expected value range for the input tensors. It seems that it is float32, [0,1], but asking just in case.

Thanks a lot for your help.
Regards,
CArlos

will release the pretraining code?

OutOfMemoryError

Hello author. I just tried your latest uploaded code（sample_inference.py） and it runs perfectly. However, when I set the device to cuda, it will prompt that there is not enough graphics memory. May I ask why this is?

how to train our dataset？thanks for your answer

RuntimeError: expected scalar type float but found double

When I run EfficientSAM_example.py file, a RuntimeError appears: expected scale type float but found double

Regarding the issue of EfficientSAM of checkpoints

Hello, I'm grateful for your research.
I have a question about checkpoints. Is the change in weights of efficentsam compared to sam only the image encoder part.
Or can I get the correct result by processing the weights of the image encoder of efficentsam with the weights of the prompt encoder and mask decoder of the original sam?

The inference time significantly increases when using multi-point prompts.

Hello! When I use multiple prompt inputs, the inference speed becomes slow. When the number of prompt inputs is 1, the inference time is approximately 0.11 seconds. However, when the number of prompt inputs is 30, the inference time exceeds 3 seconds. What could be the reason for this?

OutOfMemoryError

Hello author,
when running the sample_inference.py you provided the day before yesterday, it can run on the CPU but shows insufficient graphics memory when placed on the GPU. May I ask why this is happening? Additionally, I would like to ask if it is possible to directly call the
build_efficient_sam if I want to make fine-tuning on my own dataset?

Pre-trained models

Hi, it's a nice work and we're interested in it!
Could you release the pre-trained models (ViT-S or ViT-Ti)? We believe the pre-trained models also matter for the community! 😊

Segment Anything CPP Wrapper for macOS

Thanks for your great work.

Here is the Segment Anything CPP Wrapper for macOS and it corresponds to MobileSAM, EdgeSAM, EfficientSAM, HQ-SAM, and Segment Anything models.

Please let us know your opinion.

How to perform "segment everything" and "salient instance segmentation"??

Dear author, thank for releasing your model and Colab notebook on example usage. But in your Colab notebook, you demonstrated how to perform inference with box prompt and point prompt inputs. What about "segment everything" and "salient instance segmentation"? How do we do that? Can you please provide some examples.

eval on COCO dataset

Hi,

I found that in the paper, there is evaluation results on COCO dataset,

would you mind providing the corresponding code?

Thanks

SqueezeSAM

Great work! I was wondering what's happening with SqueezeSAM as it's been added to this repo and then subsequently removed. I'd like to try it for its salient segmentation features. Thanks!

different in jit to pt model

Why do I find that the JIT model provided initially performs significantly better than the current PT model or the converted ONNX model during testing? Additionally, I noticed three nodes in the JIT model that are missing in the PT models.

grounded-efficient-sam demo support

Hello Yunyang @yformer ! Thanks for your nice work! We've alrealy support grounded-efficient-sam demo under Grounded-Segment-Anything! Hope for the training code release!

Code about Cross-Attention Decoder

Thanks for your great work! I want to know how the Cross-Attention Decoder is implemented. Do we need to determine the positional relationship between mask tokens and unmask tokens?

NameError: name 'img_tensor' is not defined

Getting this error while passing the image to Box Segmentation.
Note: I ran every cell above it.

CoreML

Would it be possible to export this model to CoreML? In my attempt I got:

ValueError: Torch var fusion_type.1 not found in context

Datasets used in pretraining

In Section 4.1, it seems that only IN1K is used for pretraining. But in Table 1, both SA-1b and IN1K are used in pretraining. Which is the correct one?

How to use a text prompt.

Thank you for your amazing research. I would like to use a text prompt. However, when I examined the code, there was no code related to the text prompt. I'm curious if you have any plans to add it in the future or if there's a specific reason why text prompts are not available. Thank you.

Consider hosting model files in repo itself

Congrats on the release of the EfficientSAM models! I noticed the files were hosted in Dropbox and created a quick fork to host in GitHub itself (using the XetData extension):

https://github.com/xetdata/EfficientSAM

This lets you work with the models locally, keep them in the Git repo, and just push up to GitHub without having to upload in separate place. I'd love to work with y'all to do the same with this repo too if you're interested!

torch.onnx.errors.UnsupportedOperatorError: Exporting the operator 'aten::tile' to ONNX opset version 17 is not supported.

Thank you for sharing the great work ! I want to export the onnx model and when I run export_to_onnx.py, the following error encounters:
torch.onnx.errors.UnsupportedOperatorError: Exporting the operator 'aten::tile' to ONNX opset version 18 is not supported. Please feel free to request support or submit a pull request on PyTorch GitHub: https://github.com/pytorch/pytorch/issues.

Based on the documents of ONNX supported TorchScript operators, the aten:tile is supported since opset version 13. Could you give me some suggestions ?

Regarding the performance of SqueezeSAM

Dear author, I see you have released a new model SqueezeSAM. However, I couldn't find much information regarding its performance relative to the other EfficientSAM (Tiny, Small) in your paper or this repo's README. Can you shed some information on this?

Confusion about the pre-training cross-attention's K and V

'keys and values derive from both unmasked features from encoder and masked features.' But I have no idea where do the masked features come from? what do the 'masked features' refer to? And how did you merge these with the unmasked features from the encoder? I can't have a clear idea from the paper.
Thank you a lot!

Why is the decoder_max_num_input_points set to 6

Thanks for the author's effort, this is a very meaningful work, and it has been a great help to me.
I found that there is a parameter decoder_max_num_input_points available in the EfficientSAM, which specifies the maximum number of points accepted by the model. I would like to know why 6 is set here, and if we want to change this parameter, is there a maximum value?
Moreover, the current model does not seem to support negative sample points and mask input. Can you give more details about EfficientSAM's training?

Saliency Segment CODE WANTED

Thank you for you have released the project code , it's a pretty good work ! Could you upload the notebook code of the saliency segment code? Looking forward to you reply!

after load jit file, torchscript runtimeError

File "Grounded-Segment-Anything/EfficientSAM/app.py", line 60, in efficient_sam_box_prompt_segment
predicted_logits, predicted_iou = model(
^^^^^^
File "/root/anaconda3/envs/ml_zb/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: shape '[0, 3, 1, 2]' is invalid for input of size 1572864

after add env like this:
export PYTORCH_NVFUSER_DISABLE=fallback && python EfficientSAM/app.py

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/root/anaconda3/envs/ml_zb/lib/python3.11/site-packages/gradio/queueing.py", line 501, in process_events
response = await self.call_prediction(awake_events, batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/ml_zb/lib/python3.11/site-packages/gradio/queueing.py", line 465, in call_prediction
raise Exception(str(error) if show_error else None) from error
Exception: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: thread_predicates_.find(tv_inp) != thread_predicates_.end() INTERNAL ASSERT FAILED at "../third_party/nvfuser/csrc/lower_thread_predicate.cpp":221, please report a bug to PyTorch. Thread predicate map was not initialized, couldn't find T6_l[ 0 ]

can anyone help me ?

Integration to segment_anything package?

Hey,

Thanks for your work and model!
Is there a plan to integrate this model to work with the segment_anything package, specifically with SamAutomaticMaskGenerator?

For example, this is how we use it today:

        self.sam = sam_model_registry["vit_h"](checkpoint=model_checkpoint_path).to(self.device)
        self.mask_generator = SamAutomaticMaskGenerator(self.sam, pred_iou_thresh=0.88, stability_score_thresh=0.8, min_mask_region_area=200)

will we be able to use the same with the EfficiantSAM model?
Thanks!

Expose `multimask_output` in `EfficientSam.forward`

This would involve adding the optional parameter to EfficientSam.forward, and then passing it to EfficientSam.predict_masks.

    def forward(
        self,
        batched_images: torch.Tensor,
        batched_points: torch.Tensor,
        batched_point_labels: torch.Tensor,
        scale_to_original_image_size: bool = True,
        multimask_output: bool = True
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Predicts masks end-to-end from provided images and prompts.
        If prompts are not known in advance, using SamPredictor is
        recommended over calling the model directly.

        Arguments:
          batched_images: A tensor of shape [B, 3, H, W]
          batched_points: A tensor of shape [B, num_queries, max_num_pts, 2]
          batched_point_labels: A tensor of shape [B, num_queries, max_num_pts]
          multimask_output: If True, generate multiple masks for each query. Otherwise, generate one mask per query.

        Returns:
          A list tuples of two tensors where the ith element is by considering the first i+1 points.
            low_res_mask: A tensor of shape [B, 256, 256] of predicted masks
            iou_predictions: A tensor of shape [B, max_num_queries] of estimated IOU scores
        """
        batch_size, _, input_h, input_w = batched_images.shape
        image_embeddings = self.get_image_embeddings(batched_images)
        return self.predict_masks(
            image_embeddings,
            batched_points,
            batched_point_labels,
            multimask_output=multimask_output,
            input_h=input_h,
            input_w=input_w,
            output_h=input_h if scale_to_original_image_size else -1,
            output_w=input_w if scale_to_original_image_size else -1,
        )

I'm happy to make a PR for this, but I figure it may be easier to just throw this in as part of the ongoing updates you all are making.

Thanks for releasing this code and updating it so frequently!

Edit: I tried the code above, and using multimask_output=False seems to be giving me broken masks, so I'm probably missing something and this may be more involved than I'd thought. The bug could also be in my postprocessing code.

For the dog example image and points, this is what I get with, and without multimask:

Failed to export model to ONNX

I have tried to export the encoder of the model to ONNX, but it informs me that the export has failed. Can anyone who has done relevant work give some advice?

"segment-anything" is slow

Hello, I'm grateful for your research. I tried segment-anything with the code you shared. I though it would be fast, but it shows a very slow speed with an average of 13000ms. Can you tell me the reason why?

from efficient_sam.build_efficient_sam import build_efficient_sam_vits
import zipfile

with zipfile.ZipFile("weights/efficient_sam_vits.pt.zip", 'r') as zip_ref:
    zip_ref.extractall("weights")
efficient_sam_vits_model = build_efficient_sam_vits()
efficient_sam_vits_model.eval()

import os
image_path = "dataset/data1"
image_list = os.listdir(image_path)

import time
for image_name in image_list:
    image_path1 = os.path.join(image_path, image_name)
    st = time.time()
    mask_efficient_sam_vits = run_everything_ours(image_path1, efficient_sam_vits_model)
    et = time.time()
    print('inference time:', (et - st)*1000)

Output:

inference time: 17762.072563171387
inference time: 13303.216695785522
inference time: 13251.076221466064
inference time: 12843.31202507019

image_size : 640x640
GPU : RTX A6000