I observe that there are many low-quality pairs in the dataset that is published, such as pairs with significant differences between the initial and final frames, or where the initial frame is the beginning of a video (with no substantive content). These instances are likely due to training noise. Are there any further filtering methods available?
Additionally, each video in ChangeIt contains multiple shots. In this case, how can we avoid the model from treating the contents of different shots as collected into one pair? Are there more granular annotations available in ChangeIt (such as distinguishing each shot), or do the models provided by ChangeIt have this capability? Alternatively, can the dataset be pre-clipped using other methods?
Looking forward to your response, thank you very much.