Through all the available senses humans can actually sense the emotional state of their communication partner. The emotional detection is natural for humans but it is very difficult task for computers; although they can easily understand content based information, accessing the depth behind content is difficult and that’s what speech emotion recognition (SER) sets out to do. It is a system through which various audio speech files are classified into different emotions such as happy, sad, anger and neutral by computer. SER can be used in areas such as the medical field or customer call centers. With this project I hope to look into applying this model into an app that individuals with ASD can use when speaking to others to help guide conversation and create/maintain healthy relationships with others who have deficits in understanding others emotions.
Importance
Nowadays, more and more intelligent systems are using emotion recognition models to improve their interaction with humans. This is important, as the systems can adapt their responses and behavioral patterns according to the emotions of the humans and make the interaction more natural.
In this project we have to detect a person’s emotions just by their voice which will let us manage many AI related applications. Detecting emotions is one of the most important marketing strategy in today’s world. You could personalize different things for an individual specifically to suit their interest.
AIM:-
We will solve the above-mentioned problem by applying deep learning algorithms to audio/speech data. The solution will be to identify emotions in speech.
Data Summary
We have built a deep learning model which detects emotions of speech. For this purpose we’ve used Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)-2018 dataset. The description of dataset is as follows :
- RAVDESS contains 1440 files: 60 trials per actor x 24 actors = 1440. The RAVDESS contains 24 professional actors (12 female, 12 male)
- Speech emotions includes 8 emotions such as neutral, calm, happy, sad, angry, fearful, surprise, and disgust expressions.
- Filename identifiers:-
Modality (01 = full-AV, 02 = video-only, 03 = audio-only).
Vocal channel (01 = speech, 02 = song).
Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06=fearful, 07 = disgust, 08 = surprised).
Ex filename = 03-01-06-01-02-01-12.wav =Audio-only (03), Speech (01), Fearful (06).
Data Preprocessing
- Extracting features using LIBROSA ( Librosa is a python package for music and audio analysis ).
- Converting audio files into waveforms and spectrograms
- Creating data frame for extracted features for audio files.
- Data augmentation.
How many emotions are there for each class in audio dataset?
After done various EDA and preprocessing, first we converted audio files to numeric data and tried to apply convolution model. Then we converted audio files to Mel Spectrograms as image form and applied different models and transfer learning models.
Model Implementation:
Best Model:
Model is 82% accurate which is best among all models.
Implementation of video :
Challenges
• As this project is not a stand-alone data science project, it required little to medium domain knowledge. Understanding Mfccs, Mel scale is very important for feature selection.
• Limitations include not using feature selection to reduce the dimensionality of our augmented CNN which may have improved learning performance.
• During deployment in Heroku and Azure, there were problems as some additional libraries needed to installed. After installing these libraries , some of previous requirements had to be downgraded in order to be compatible with the installed ones, that’s why we deployed on GCP.
Conclusion
• VGG19 (fine tuning + augmentation) was giving the best accuracy score of 82% and solved the problems like overfitting to some extent.
• It's quite difficult to get the accuracy of more than 90% due to lack of data.
• To solve problems like over-fitting that we had seen in almost every model, we need more real time data.
• Noise Adding ,Pitching and Shifting for the imbalanced data was helping in achieving a better result.
• Computational cost was much high resulting in several runtime crashes but we’re able to get our best model for deployment.