Human Identitfication using Multi-Modal models, using audio-image pair, and also seperately.
VoxCeleb Raw Data was used in this project VoxCeleb Data, which is a large scale audio-visual dataset of human speech. Following are the download links to raw data used:-
Data Selection is done locally using Windows File Manager Only.
- Subset of Image Data is created according to the audio data available, and named as subSetFaces.
All Pre-Processing of Data is done in models.ipynb only.
- Getting, Cropping and saving Image Data in np array of np.unit8 format from subSetFaces as 100 images per person for 40 persons, creating labels.
- Getting and Trimming Audio Data to equal sizes, as 100 audio files per persona for 40 persons.
- Converting equal sized Audio Files into spectograms with log frequency, and storing it in the form of np array in np.unit8 format.
- Mapping Images Audio to Audio-Image Paired Data.
- Both Images and Audio are Rescaled from 0 to 1 inside Models in Rescaling Layer, this is to save memory as storing data in the form of np.unit8 format is much less expensive than storing them in float64 format.
- CNN for Image Classification:- This model consists of One Preprocessing Rescaling Layer, Two Convolution2D Layers with L2 Regularization, Two MaxPooling2D Layers, One Flatten Layer, Two Dense Layers and One Dropout Layer.
- CNN for Audio Classification:- This model consists of One Preprocessing Rescaling Layer, Two Convolution2D Layers with L2 Regularization, Two MaxPooling2D Layers, One Flatten Layer, Two Dense Layers.
- CRNN for Audio Classification:- This model consists of One Preprocessing Rescalingn Layer, Two Convolution1D Layers with L2 Regularization, Two MaxPooling1D Layers, One LSTM Layers, Two Dense Layers.
- CNN-CNN Parrallel Multi-Modal Model:- This model consists of 1st and 2nd Models in prallel and Fusion Model with Two Dense Layers and One Dropout Layer.
- CNN-CRNN Parrallel Multi-Modal Model:- This model consists of 1st and 3rd Models in prallel and Fusion Model with Two Dense Layers and One Dropout Layer.
- Python 3
- Matplotlib
- PIL
- numpy
- librosa
Model | Training Accuracy | Valiation Accuracy |
---|---|---|
CNN Image Classification | 100.0% | 84.50% |
CNN Audio Classification | 100.0% | 79% |
CRNN Audio Classification | 79.75% | 71.50% |
CNN-CNN Multi Modal Model | 99.86% | 73.25% |
CNN-CRNN Multi Modal Model | 99.92% | 80.00% |
Run models.ipynb file.
Best Result was achieved with CNN-CRNN parallel multi modal model.