gkordo / s2vs Goto Github PK

Authors official PyTorch implementation of the "Self-Supervised Video Similarity Learning" [CVPRW 2023]

License: MIT License

Python 99.51% Shell 0.49%

duplicate-videos fivr ndvr self-supervised-learning self-supervision video-detection video-retrieval video-search video-similarity video-similarity-learning

s2vs's People

Contributors

Stargazers

Watchers

Forkers

jinhasong minsoo-jeong

s2vs's Issues

Toy example with two videos

Hi,
I tested your pretrained model using the two videos inside data/examples.

Starting from the suggestions you provided, I wrote the following code

import torch
from utils import load_video
from model.similarity_network import ViSiL
import evaluation as eval # This is your evaluation.py module


feat_extractor = torch.hub.load('gkordo/s2vs:main', 'resnet50_LiMAC')
s2vs_dns = torch.hub.load('gkordo/s2vs:main', 's2vs_dns')
s2vs_vcdb = torch.hub.load('gkordo/s2vs:main', 's2vs_vcdb')

# Load the two videos from the video files
query_video = torch.from_numpy(load_video('./data/examples/video1/'))
target_video = torch.from_numpy(load_video('./data/examples/video2/'))

# Initialize pretrained ViSiL model
#model = ViSiL(pretrained='s2vs_dns').to('cuda')
model = SimilarityNetwork['ViSiL'].get_model(pretrained='s2vs_dns').to('cuda')
model.eval()

# Extract features of the two videos
query_features = eval.extract_features(feat_extractor.to('cuda'), query_video.to('cuda'))
target_features = eval.extract_features(feat_extractor.to('cuda'), target_video.to('cuda'))

# Calculate similarity between the two videos
similarity = model.calculate_video_similarity(query_features, target_features)
print(similarity)

The results I got are:

similarity = 0.9781 when comparing video1 to video1 itself
similarity = 0.7917 when comparing video1 to video2

Since video1 and 2 are completely different, I would have expected a lower value for the similarity score.
I'm mainly interested in the copy detection task and I wonder if 0.79 can actually be considered a "low value" such that I can argue that the two videos are not potential copies.

Maybe I'm missing something or my code is wrong.

Any help would be really appreciated.

Thank you again for this work

How to deal with large videos

Hi Giorgos,
thanks again for this work and for your support in my previous issues.

After some experiments on custom videos, I'm struggling with calculating the similarity between large videos (>7 min), due to a limted amount of memory in my GPU.

In fact, when I try to process such videos, I get a "CUDA out of memory error".

I managed to overcome this issue in the features extraction part, by setting fps=1 and splitting the query and target videos into N chuncks, computing the features for each chunck and then stack together all the N features tensor into a single features tensor (does it make sense?).

But when it comes to the similarity part, specifically with the calculate_video_similarity function, I get the above error.

Do you have any suggestion on how to optimize the similarity part for such videos?

I guess that splitting the query and target videos into several chuncks and compute the similarity between chunks would not result in a meaningful similarity check, but maybe I'm wrong.

Thanks a lot.

EDIT: After further investigation, it seems that what is causing the error is the torch.einsum operation inside the frame_to_frame_similarity function.

Test on a custom dataset

Hi, and thank you for this amazing work.

I was trying to test your code on a custom toy dataset with a couple of "original" videos and a couple of "altered" videos derived from the previous.

Any suggestion on how to structure the dataset? I downloaded the VCDB dataset and, looking at your vcdb.py, I suppose that you serialized the dataset into a pickle file. Can you provide some additional info about this process?

I thought to organize my custom dataset similarly to VCDB, and from your code I see you enforced the following structure:

self.queries = dataset['queries']
self.positives = dataset['positives']
self.dataset = dataset['dataset']

I suppose that 'queries' are the altered videos and 'dataset' the original videos, but:

what 'positives' stands for? If it is the ground truth, maybe I can skip it since in this real world scenario I don't have GT.
in which format should I encode videos into these keys? Are they simple lists containing the paths pointing of the videos? For instance:
dataset['queries'] = ['.../mydataset/video1.mp4', '.../mydataset/video2.mp4',...] ?

Thank you.

Did you see that paper ?

https://www.semanticscholar.org/reader/93d3b14ac1c5b86dc7ea2eef54d4541e4e2df57f

This work seems to be very similar.
Did you take inspiration of it ?

How do you think do you perform vs this work ?

If you release the code open source, it already make a huge difference.

Anyway, your paper is very insightful !

gkordo / s2vs Goto Github PK

s2vs's People

Contributors

Stargazers

Watchers

Forkers

s2vs's Issues

Toy example with two videos

How to deal with large videos

Test on a custom dataset

Did you see that paper ?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent