Coder Social home page Coder Social logo

rlleshi / phar Goto Github PK

View Code? Open in Web Editor NEW
216.0 7.0 25.0 1.77 MB

deep learning sex position classifier

License: Apache License 2.0

Python 98.96% Dockerfile 1.04%
action-recognition deep-learning porn-filter pornhub pytorch video-classification video-understanding human-action-recognition sex sex-classifier

phar's Introduction

P-HAR: Porn Human Action Recognition

Update

โญ In the meantime, I've trained models that surpass 94% accuracy on 20 action categories. They are readily available via an easy-to-use API. Get in touch for more details!

How this AI can benefit you:

  1. ๐Ÿท๏ธ Automated Tagging: Can easily extend to more categories as required.
  2. โฑ๏ธ Automated Timestamp Generation: Allows users to swiftly navigate to any section of the video, offering a user-friendly experience akin to YouTube's.
  3. ๐Ÿ” Improved Recommendation System: Enhances content suggestions by analyzing the occurrences and timings within the video, providing more relevant and tailored recommendations.
  4. ๐Ÿšซ Content Filtering: Facilitates the filtering of specific content, such as non-sexual content, or certain actions and positions, allowing for a more personalized user experience.
  5. ๐ŸŽž๏ธ Shorts: Enables the extraction of specific actions from videos to create concise and engaging clips, a feature particularly popular among Gen Z users.

If you're interested in some of the technical details of the first version, read on!

Introduction

This is just a fun, side-project to see how State-of-the-art (SOTA) Human Action Recognition (HAR) models fare in the pornographic domain. HAR is a relatively new, active field of research in the deep learning domain, its goal being the identification of human actions from various input streams (e.g. video or sensor).

The pornography domain is interesting from a technical perspective because of its inherent difficulties. Light variations, occlusions, and a tremendous variations of different camera angles and filming techniques (POV, dedicated camera person) make position (action) recognition hard. We can have two identical positions (actions) and yet be captured in such a different camera perspective to entirely confuse the model in its predictions.

This repository uses three different input streams in order to get the best possible results: rgb frames, human skeleton, and audio. Correspondingly three different models are trained on these input streams and their results are merged through late fusion.

The best current accuracy reached by this multi-model model currently is 75.64%, which is promising considering the small training set. This result will be improved in the future.

The models work on spatio-temporal data, meaning that they processes video clips rather than single images (miles-deep is using single images for example). This is an inherently superior way of performing action recognition.

Currently, 17 actions are supported. You can find the complete list here. More data would be needed to further improve the models (help is welcomed). Read on for more information!

Supported Features

First download the human detector here, pose model here, and HAR models here. Then move them inside the checkpoints/har folder.

Or just use a docker container from the image.

Video Demo

Input a video and get a demo with the top predictions every 7 seconds by default.

python src/demo/multimodial_demo.py video.mp4 demo.mp4

Alternatively, the results can also be dumped in a json file by specifying the output file as such.

If you only want to use the RGB & Skeleton model, then you can disable the audio model like so:

python src/demo/multimodial_demo.py video.mp4 demo.json --audio-checkpoint '' --coefficients 0.5 1.0 --verbose

Check out the detailed usage.

Timestamp Generator

Use the flag --timestamps

python src/demo/multimodial_demo.py video.mp4 demo.json --timestamps

Tag Generator

Given the predictions generated by the multimodial demo (in json), we can grab the top 3 tags (by default) like so:

python src/top_tags.py demo.json

Checkout the detailed usage.

Content Filtering

TODO: depending if people need it.

Deployment

Depends if people find this project useful. Currently one has to install the relevant libraries to use these models. See the installation section below.

Motivation & Usages

The idea behind this project is to try and apply the latest deep learning techniques (i.e. human action recognition) in the pornographic domain.

Once we have detailed information about the kind of actions/positions that are happening in a video a number of uses-cases can apply:

  1. Improving the recommender system
  2. Automatic tag generator
  3. Automatic timestamp generator (when does an action start and finish)
  4. Cutting content out (for example non-sexual content)

Installation

Docker

Build the docker image: docker build -f docker/Dockerfile . -t rlleshi/phar

Manual Installation

This project is based on MMAction2.

The following installation instructions are for ubuntu (hence should also work for Windows WSL). Check the links for details if you are interested in other operating systems.

  1. Clone this repo and its submodules: git clone --recurse-submodules [email protected]:rlleshi/phar.git and then create and environment with python 3.8+.
  2. Install torch (of course, it is recommended that you have CUDA & CUDNN installed).
  3. Install the correct version of mmcv based on your CUDA & Torch, e.g. pip install mmcv-full==1.3.18 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.10.0/index.html
  4. Install mmaction:2 cd mmaction2/ && pip install cython --no-cache-dir && pip install --no-cache-dir -e .
  5. Install MMPose, link.
  6. Install MMDetection, link.
  7. Install extra dependencies: pip install -r requirements/extra.txt.

Models

The SOTA results are archieved by late-fusing three models based on three input streams. This results in significant improvements compared to only using an RGB-based model. Since more than one action might happen at the same time (and moreover, currently, some of the actions/positions have are conceptually overlapping), it is best to consider the top 2 accuracy as a performance measurement. Hence, currently the multimodial model has a ~75% accuracy. However, since the dataset is quite small and in total only ~50 experiments have been performed, there is a lot of room for improvement.

Multi-Modial (Rgb + Skeleton + Audio)

The best performing models (performance & runtime wise) are timesformer for the RGB stream, poseC3D for the skeleton stream, and resnet101 for the Audio stream. The results of these models are fused together through late fusion. The models do not have the same importance in the late fusion scoring scheme. Currently the fine-tuned weights are: 0.5; 0.6; 1.0 for the RGB, skeleton & audio model respectively.

Another approach would be to train a model with two of the input streams at a time (i.e. rgb+skeleton & rgb+audio) and then perhaps combine their results. But this wouldn't work due to the nature of the data. When it comes to the audio input streams, it can only be exploited for certain actions (e.g. deepthroat due to the gag reflex or anal due to a higher pitch), while for others it's not possible to derive any insight from their audio (e.g. missionary, doggy and cowgirl do not have any special characteristics to set them apart from an audio perspective).

Likewise, the skeleton-based model can only be used in those instances where the pose estimation is accurate above a certain confidence threshold (for these experiments the threshold used was 0.4). For example, for actions such as scoop-up or the-snake it's hard to get an accurate pose estimation in most camera angles due to the proximity of the human bodies in the frame (the poses get fuzzy and mixed up). This then influences the accuracy of the HAR model negatively. However, for actions such as doggy, cowgirl or missionary, the pose estimation is generally good enough to train a HAR model.

However, if we have a bigger dataset, then we will probably have enough instances of clean samples for the difficult actions such as to train all (17) of them with a skeleton-based model. Skeleton based models are according to the current SOTA literature superior to the rgb-based ones. Ideally of course, the pose estimation models should also be fine tuned in the sex domain in order to get a better overall pose estimation.

Metrics

Accuracy Weights
Top 1 Accuracy: 0.6362
Top 2 Accuracy: 0.7524
Top 3 Accuracy: 0.8155
Top 4 Accuracy: 0.8521
Top 5 Accuracy: 0.8771
Rgb: 0.5
Skeleton: 0.6
Audio: 1.0

RGB model - TimeSformer

The best results for a 3D RGB model are achieved by the attention-based TimeSformer architecture. This model is also very fast in inference (~0.53s / 7s clips).

Metrics

Accuracy Training Speed Complexity
top1_acc 0.5669
top2_acc 0.6834
top3_acc 0.7632
top4_acc 0.8096
top5_acc 0.8411
Avg iter time: 0.3472 s/iter Flops: 100.96 GFLOPs
Params: 121.27 M

Loss

alt text

Classes

All 17 annotations. See annotations.

Skeleton model - PoseC3D

The best results for a skeleton-based model are achieved by the CNN-based PoseC3D architecture. This model is also fast in inference (~3.3s / 7s clips).

Metrics

Accuracy Training Speed Complexity
top1_acc 0.8130
top2_acc 0.9191
top3_acc 0.9748
Avg iter time: 0.8616 s/iter Flops: 17.83 GFLOPs
Params: 2.0 M

Check the confusion matrix for a detailed overview of the performance.

Loss

alt text

Classes

6 annotations. See annotations.

Audio Model - Simple ResNet based on Audiovisual SlowFast

A simple ResNet 101 (with some small tweaks) was used. This model definitely needs to be swapped with a better architecture. It is very fast in inference (0.05s / 7s audio clips).

Metrics

Accuracy Training Speed
top1_acc 0.6867
top2_acc 0.9038
top3_acc 0.9663
Avg iter time: 0.2747 s/iter

Check the confusion matrix for a detailed overview of the performance.

Loss

alt text

Classes

4 annotations. See annotations.

Dataset

First things first, here is a list of definitions of the sex positions used in this project in case there is any confusion. fondling, in addition to the meaning of the word, was also thought of as a general placeholder, e.g. when it is unclear what action there is. In reality, however, its ability to be a general placeholder is limited because I only got 48 minutes of data for this action.

The gathered dataset is very inclusive and consists of a variety of recordings such as POV, professionally filmed, amateur, with or without a dedicated camera person, etc. It also includes all kinds of environments, people, and camera angles. The problem is probably much easier to solve if only professional recordings with a dedicated camera person are used and hence this was avoided.

In general, a train/val split of 0.8/0.2 was used for all the datasets. The length of the clips in training & validation sets currently is 7 seconds (the main motivation was to include the more ephemeral actions such as cumshot or kissing). In total there were around 600 videos amounting to 2674 minutes of footage. Check out the annotation distribution in time (minutes) for each of the 17 classes for more information. The dataset was not perfectly annotated but the number of wrong annotations should be small and hence the drop in performance should be minimal.

In general, it can be said that this is a small dataset. Normally ~44 hours of footage would be enough for 17 actions. However, each position has a tremendous variety when it comes to camera perspectives, which makes the recognition task hard if there aren't enough samples. This would also mean that we should ideally have the same amount of footage for each different perspective. However, labeling the dataset was already very time-consuming and I didn't keep track of this point.

A HAR model trained on 3D poses might be able solve this camera-perspective problem. However, due to the fact that 3D pose estimation is less accurate than 2D pose estimation, and I already noticed problems with the accuracy of the 2D (see here), this has not been tried (yet). Ideally, however, if the dataset is big enough then the camera perspective problem should be naturally solved.

The dataset is also slightly imbalanced, which actually makes the rgb models slightly biased towards the positions (actions) that have more data.

If you'd like to help with doubling the current size of the dataset, please do open an issue.

RGB

In total there are ~17.6K training clips and ~4.9k val clips. This plot shows the number of clips for each class. The RGB is considered the kernel input modality given that the audio modality is only applied to four classes and that the skeleton modality is rather fickle because of the accuracy of 2D pose estimation. Various data augmentation techniques were applied such as rescaling, cropping, flipping, color inversion, gaussian blur, elastic transformation, affine transformation, etc. This further improves the accuracy of the model.

(2D) Pose

Due to the variety of positions and camera angles, which make the pose estimation difficult as human bodies overlap and are too close, it's only feasible to apply HAR on skeleton data on a few of the actions. The clips generated for the RGB dataset were filtered based on two criteria:

  1. The confidence of the pose information. Minimal confidence of 0.4 was chosen.

  2. The number of frames in a clip that have a confidence higher than the minimal confidence score. Here a 0.4 rate was also used. In other words, if we have a 7s clip of 210 frames and only 70 frames have pose information with confidence higher than 0.4, then we exclude this clip from the pose dataset because only 33% of the frames have a confidence higher than 0.4 and our minimum threshold is 40%.

As a result, the pose dataset is significantly smaller than the original RGB dataset. Whereas there are about 4.9K testing clips for the RGB dataset, the pose dataset has only 815 clips. Therefore a bigger dataset is a must here so that we are able to train the skeleton model on all 17 actions.

Audio

As a preliminary pre-processing step audios that are not loud enough were first pruned from the dataset. The best results were achieved by prunning the bottom 20% of the quietest audios.

In total there are about 5.9K training clips & 1.5K validation clips.

Script Docs

Multimodial Demo

python src/demo/multimodial_demo.py ${VIDEO_FILE} ${OUTPUT_FILE}
    [--det-config ${HUMAN_DETECTION_CONFIG_FILE}] \
    [--det-checkpoint ${HUMAN_DETECTION_CHECKPOINT}] \
    [--pose-config ${HUMAN_POSE_ESTIMATION_CONFIG_FILE}] \
    [--pose-checkpoint ${HUMAN_POSE_ESTIMATION_CHECKPOINT}] \
    [--skeleton-config ${SKELETON_BASED_ACTION_RECOGNITION_CONFIG_FILE}] \
    [--skeleton-checkpoint ${SKELETON_BASED_ACTION_RECOGNITION_CHECKPOINT}] \
    [--rgb-config ${RGB_BASED_ACTION_RECOGNITION_CONFIG_FILE}] \
    [--rgb-checkpoint ${RGB_BASED_ACTION_RECOGNITION_CHECKPOINT}] \
    [--audio-config ${AUDIO_BASED_ACTION_RECOGNITION_CONFIG_FILE}] \
    [--audio-checkpoint ${AUDIO_BASED_ACTION_RECOGNITION_CHECKPOINT}] \
    [--det-score-thr ${HUMAN_DETECTION_SCORE_THRE}] \
    [--label-maps ${LIST_OF_ACTION_ANNOTATION_FILES}] \
    [--num-processes ${NUM_PROC_USED_FOR_SUBCLIP_EXTRACTION}] \
    [--subclip-len ${PREDICTION_WINDOW}] \
    [--device ${DEVICE}] \
    [--coefficients ${COEFFICIENT_WEIGHTS}] \
    [--pose-score-thr ${POSE_ESTIMATION_SCORE_THRESHOLD}] \
    [--correct-rate ${RATE_OF_CORRECT_FRAMES_FOR_SKELETON_RECOGNITION}] \
    [--loudness-weights ${LOUDNESS_THRESHOLD_FOR_AUDIOS}] \
    [--topk ${TOP_K_ACCURACY}]
    [--timestamps]
    [--verbose]

Late Fusion

python src/top_tags.py ${JSON_FILE}
    [--topk ${TOP_K_ACCURACY}]
    [--label-map ${ANNOTATION_FILE}]

phar's People

Contributors

rlleshi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

phar's Issues

Installation problem

Hi i was interested in this project and wanted to try it out in my windows machine. i am new to python and i followed your manual installation tutorial and reached endless chaos of issues. i think the read me is either outdated or wrong.

Clone this repo and its submodules: git clone --recurse-submodules [email protected]:rlleshi/phar.git and then create and environment with python 3.8+.

Install torch (of course, it is recommended that you have CUDA & CUDNN installed).

i cloned your project
i installed python (Python 3.9.13)
installed cuda and cuDNN from nvidea
installed pytorch (pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121)

Install the correct version of mmcv based on your CUDA & Torch, e.g. pip install mmcv-full==1.3.18 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.10.0/index.html

i navigated to project main directory (where src etc... are located)
i went below link and installed mmcv mmdet,mmaction etc...
https://mmaction2.readthedocs.io/en/latest/get_started/installation.html#best-practices

Install extra dependencies: pip install -r requirements/extra.txt.

done

now when i run the project like below
python src/demo/multimodial_demo.py video.mp4 demo.json --timestamps

i am getting errors

there is a requirements.txt inside requirements folder . am i supposed to install that as well?
in documentation there is only mention about extra.txt

these are the errors i get do you know why this happen

Traceback (most recent call last):
File "src/demo/multimodial_demo.py", line 20, in
from demo_skeleton import frame_extraction
File "C:\zprojects\test\phar\src\demo\demo_skeleton.py", line 11, in
from mmcv import DictAction
ImportError: cannot import name 'DictAction' from 'mmcv' (C:\Users\admin.conda\envs\openmmlab\lib\site-packages\mmcv_init_.py)

Collaboration?

I saw a repo related text2video and thought about customized porno generation. I made a plan including 3 stages:
1- make classifiers for ethnicity, age, actor attirbutes, position, and scenerio to label videos with timestamps.
The data creation method I used in stage one is download images from google,yandex, and bing and label them with search keyword. For example I searched for cowgirl porno and downloaded images and labeled respectively. (2000 images per class in total without augmentation)
I stuck at stage one because all of my models overfitting no matter what I tried. here is my email [email protected] for detailes.

Issues trying to get the demo working

I've been following the steps to install and spent several hours today trying to get the demo working but I'm getting errors. I've followed the steps in the instructions exactly. When doing a manual install, I get this error:
No CUDA runtime is found, using CUDA_HOME='C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3\bin'
Traceback (most recent call last):
File "c:\users\tyler\source\repos\phar\mmaction2\demo\demo_skeleton.py", line 16, in
from mmdet.apis import inference_detector, init_detector
File "C:\Users\tyler\source\repos\venv\lib\site-packages\mmdet\apis_init_.py", line 2, in
from .inference import (async_inference_detector, inference_detector,
File "C:\Users\tyler\source\repos\venv\lib\site-packages\mmdet\apis\inference.py", line 8, in
from mmcv.ops import RoIPool
File "C:\Users\tyler\source\repos\venv\lib\site-packages\mmcv\ops_init_.py", line 2, in
from .active_rotated_filter import active_rotated_filter
File "C:\Users\tyler\source\repos\venv\lib\site-packages\mmcv\ops\active_rotated_filter.py", line 8, in
ext_module = ext_loader.load_ext(
File "C:\Users\tyler\source\repos\venv\lib\site-packages\mmcv\utils\ext_loader.py", line 13, in load_ext
ext = importlib.import_module('mmcv.' + name)
File "C:\Users\tyler\AppData\Local\Programs\Python\Python38\lib\importlib_init_.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
ImportError: DLL load failed while importing _ext: The specified module could not be found.

When trying to use docker, I get an error about not having an nvidia driver (even though I do with cuda, and cudnn setup)

Any interest in collab?

I've been looking into this topic for a while as well as similar topics: comprehensive automated tagging, personalized automated rating, automated video scripting for interactive content. I'd be interested in chatting and seeing if we have any insights that might be useful to one another. My discord is Skier23#9916

Welcome update to OpenMMLab 2.0

Welcome update to OpenMMLab 2.0

I am Vansin, the technical operator of OpenMMLab. In September of last year, we announced the release of OpenMMLab 2.0 at the World Artificial Intelligence Conference in Shanghai. We invite you to upgrade your algorithm library to OpenMMLab 2.0 using MMEngine, which can be used for both research and commercial purposes. If you have any questions, please feel free to join us on the OpenMMLab Discord at https://discord.gg/amFNsyUBvm or add me on WeChat (van-sin) and I will invite you to the OpenMMLab WeChat group.

Here are the OpenMMLab 2.0 repos branches:

OpenMMLab 1.0 branch OpenMMLab 2.0 branch
MMEngine 0.x
MMCV 1.x 2.x
MMDetection 0.x ใ€1.xใ€2.x 3.x
MMAction2 0.x 1.x
MMClassification 0.x 1.x
MMSegmentation 0.x 1.x
MMDetection3D 0.x 1.x
MMEditing 0.x 1.x
MMPose 0.x 1.x
MMDeploy 0.x 1.x
MMTracking 0.x 1.x
MMOCR 0.x 1.x
MMRazor 0.x 1.x
MMSelfSup 0.x 1.x
MMRotate 1.x 1.x
MMYOLO 0.x

Attention: please create a new virtual environment for OpenMMLab 2.0.

Installation problem

I checked the other installation issues and this appears to be different. I installed the phar package dependencies locally and tried to run the demo, but I get this error:

Traceback (most recent call last):
  File "/home/user/projects/test/scripts/phar/src/demo/multimodial_demo.py", line 20, in <module>
    from demo.demo_skeleton import frame_extraction
ModuleNotFoundError: No module named 'demo.demo_skeleton'

I am not an expert with python but it appears that I cannot install the source code as a module without a setup.py file.

Docker won't run

Hi,

I can't get the docker image to run, it gives me the following error:

docker run rlleshi/phar python src/demo/multimodial_demo.py video.mp4 demo.json --timestamps

Traceback (most recent call last):
  File "__init__.cython-30.pxd", line 984, in numpy.import_array
RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf . Check the section C-API incompatibility at the Troubleshooting ImportError section at https://numpy.org/devdocs/user/troubleshooting-importerror.html#c-api-incompatibility for indications on how to solve this problem .

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/phar/mmaction2/demo/demo_skeleton.py", line 23, in <module>
    from mmpose.apis import (inference_top_down_pose_model, init_pose_model,
  File "/workspace/phar/mmpose/mmpose/apis/__init__.py", line 2, in <module>
    from .inference import (inference_bottom_up_pose_model,
  File "/workspace/phar/mmpose/mmpose/apis/inference.py", line 14, in <module>
    from mmpose.datasets.dataset_info import DatasetInfo
  File "/workspace/phar/mmpose/mmpose/datasets/__init__.py", line 7, in <module>
    from .datasets import (  # isort:skip
  File "/workspace/phar/mmpose/mmpose/datasets/datasets/__init__.py", line 2, in <module>
    from ...deprecated import (TopDownFreiHandDataset, TopDownOneHand10KDataset,
  File "/workspace/phar/mmpose/mmpose/deprecated.py", line 5, in <module>
    from .datasets.datasets.base import Kpt2dSviewRgbImgTopDownDataset
  File "/workspace/phar/mmpose/mmpose/datasets/datasets/base/__init__.py", line 2, in <module>
    from .kpt_2d_sview_rgb_img_bottom_up_dataset import \
  File "/workspace/phar/mmpose/mmpose/datasets/datasets/base/kpt_2d_sview_rgb_img_bottom_up_dataset.py", line 8, in <module>
    from xtcocotools.coco import COCO
  File "/opt/conda/lib/python3.8/site-packages/xtcocotools/coco.py", line 58, in <module>
    from . import mask as maskUtils
  File "/opt/conda/lib/python3.8/site-packages/xtcocotools/mask.py", line 3, in <module>
    import xtcocotools._mask as _mask
  File "xtcocotools/_mask.pyx", line 23, in init xtcocotools._mask
  File "__init__.cython-30.pxd", line 986, in numpy.import_array
ImportError: numpy.core.multiarray failed to import

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "src/demo/multimodial_demo.py", line 20, in <module>
    from demo.demo_skeleton import frame_extraction
  File "/workspace/phar/mmaction2/demo/demo_skeleton.py", line 26, in <module>
    raise ImportError('Failed to import `inference_top_down_pose_model`, '
ImportError: Failed to import `inference_top_down_pose_model`, `init_pose_model`, and `vis_pose_result` form `mmpose.apis`. These apis are required in this demo! 

When I compile from source, I get almost the same error:

/ml/phar/mmcv/mmcv/cnn/bricks/transformer.py:28: UserWarning: Fail to import ``MultiScaleDeformableAttention`` from ``mmcv.ops.multi_scale_deform_attn``, You should install ``mmcv-full`` if you need this module. 
  warnings.warn('Fail to import ``MultiScaleDeformableAttention`` from '
Traceback (most recent call last):
  File "/ml/phar/src/demo/demo_skeleton.py", line 15, in <module>
    from mmdet.apis import inference_detector, init_detector
  File "/ml/phar/mmdet/mmdet/apis/__init__.py", line 1, in <module>
    from .inference import (async_inference_detector, inference_detector,
  File "/ml/phar/mmdet/mmdet/apis/inference.py", line 6, in <module>
    from mmcv.ops import RoIPool
  File "/ml/phar/mmcv/mmcv/ops/__init__.py", line 2, in <module>
    from .assign_score_withk import assign_score_withk
  File "/ml/phar/mmcv/mmcv/ops/assign_score_withk.py", line 5, in <module>
    ext_module = ext_loader.load_ext(
  File "/ml/phar/mmcv/mmcv/utils/ext_loader.py", line 13, in load_ext
    ext = importlib.import_module('mmcv.' + name)
  File "/home/firebug/miniconda3/envs/nsfw2/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
ModuleNotFoundError: No module named 'mmcv._ext'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/ml/phar/src/demo/demo_skeleton.py", line 17, in <module>
    raise ImportError('Failed to import `inference_detector` and '
ImportError: Failed to import `inference_detector` and `init_detector` form `mmdet.apis`. These apis are required in this demo! 

When compiling from source, I made sure I had the right versions of the libraries by compiling from the exact git checkpoints referenced by your code:

mmaction2 0.23.0 /ml/phar/mmaction2
mmcv 1.3.18 /ml/phar/mmcv
mmdet 2.12.0 /ml/phar/mmdet
mmpose 0.22.0 /ml/phar/mmpose

Any ideas? Thanks.

Issue with audio inference

I'm trying to test your model, but I ran into an issue with the audio inference, maybe you would have some ideas what could be wrong?

`docker run --gpus all rlleshi/phar python src/demo/multimodial_demo.py /mnt/videos/tr_87505_hd.mp4 /mnt/videos/demo.mp4
Resizing video for faster inference...
Moviepy - Building video temp/tr_87505_hd.mp4.
MoviePy - Writing audio in tr_87505_hdTEMP_MPY_wvf_snd.mp3
MoviePy - Done.
Moviepy - Writing video temp/tr_87505_hd.mp4

Moviepy - Done !
Moviepy - video ready temp/tr_87505_hd.mp4
load checkpoint from local path: checkpoints/har/timeSformer.pth
Performing RGB inference...
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 12/12 [00:17<00:00, 1.44s/it]load checkpoint from local path: checkpoints/har/audio.pth
Performing audio inference...

8%|โ–Š | 1/12 [00:14<02:43, 14.88s/it]
Traceback (most recent call last):
File "src/demo/multimodial_demo.py", line 601, in
main()
File "src/demo/multimodial_demo.py", line 562, in main
audio_inference(clip, args.coefficients)
File "src/demo/multimodial_demo.py", line 385, in audio_inference
results = inference_recognizer(AUDIO_MODEL, out_feature)
File "/workspace/phar/mmaction2/mmaction/apis/inference.py", line 99, in inference_recognizer
raise RuntimeError('The type of argument video is not supported: '
RuntimeError: The type of argument video is not supported: <class 'str'>`

Image processing

I feel that image processing would be a relatively easy proposition with your library, but before I attempt my own solution, I wanted to ask if you have (or can prepare) a demo that does single images.
Thanks.

can't seem to get it to run

At first, I got Module Demo not found. I removed the period at demo.demo_skeleton import frame extraction. The multimodial_demo.py is in the same directory as the demo_skeleton.py I thought that was a path problem. Afterward, I get this:

An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ... 

I'm using Windows 10 been trying for a few days can you help a noob out.

Issue with audio in inference-recognizer

Hello, I tried to test the app, but i got the same problem as the issue here:
#1

Stack Trace:
File "src/demo/multimodial_demo.py", line 601, in
main()
File "src/demo/multimodial_demo.py", line 562, in main
audio_inference(clip, args.coefficients)
File "src/demo/multimodial_demo.py", line 385, in audio_inference
results = inference_recognizer(AUDIO_MODEL, out_feature)
File "/workspace/phar/mmaction2/mmaction/apis/inference.py", line 99, in inference_recognizer
raise RuntimeError('The type of argument video is not supported: '
RuntimeError: The type of argument video is not supported: <class 'str'>

I checked that the code is up to date, and it's the case:
line 381..:
subprocess.run(
['python', AUDIO_FEATURE_SCRIPT, TEMP, TEMP, '--ext', 'wav'],
capture_output=True)

Any ideas on what I'm doing wrong ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.