Old School New Body

Old School New Body Review: Old School New Body is a youth-enhancing body-shaping program for men and women over the age of 35 who want to look and feel younger. From fitness experts Steve & Becky…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




1. Prerequisite

Speech processing can be divided into 2 broad categories:
1. Speech Recognition: Contents of speech audio are detected
2. Speaker Recognition: Identify speakers in a conversation

Speaker Diarisation falls into second category. Speaker Diarisation is a task to identifying the start and end time of a speaker in an audio file.

To downsample your signal you need to interpolate between your data. For downsampling and converting use below code.

Simplified diagram of speaker diarisation system

The voice activity detection module (VAD),
The feature extraction module
Clustering and Segmentation framing

a) Segmentation (using VAD): By using the librosa.feature.rms, we compute short-term energy. The VAD module is a hybrid energy-based detector and model-based decoder. In the first step, an energy-based detector finds all segments with low energy, while applying minimum segment duration. An energy threshold is set automatically to obtain enough non-speech segments.

b)Feature vectors extraction and GMM Training: By using librosa.feature.mfcc we compute the Mel Frequency Cepstral Coefficients (MFCCs) and its first and second derivatives and combine them as follows:

GMM Training:

A Gaussian mixture model (GMM) uses an expectation–maximization approach which qualitatively does the following:

The outcome of this is, Each cluster is associated not with a hard-edged sphere, but with a smooth Gaussian model.

Next step to find out how many optimal number of components we need for the given audio file for GMM Training.

The optimal number of clusters is the value that minimise the AIC or BIC, depending on which approximation we want to use. so AIC and BIC both tells us 20 or more components will be better choice.
Note this choice of number of components measures how well GMM works as a density estimator, not how well it works as a clustering algorithm.

Now next step will be write function for GMM Training

wavFile: Full path of audio file
frameRate: Number of frames per second, i have taken 50 frames per second
segLen: Length of segment in seconds , i have taken 3 segment per second
vad: Voiced activity decisions at frame level.
numMix: num of mixture in the mixture model. i have set numMix to 128 by looking into bitrates per second.

The GMM functions returns the array of predicted probability based on segment.

C) Clustering and segmentation framing:

Segmentation: After performing clustering we get speaker hypothesis value at segment level now the next step is to convert segment to frame level.For doing this use below function

Final Step : Thank you for reading this far ! now we label the speaker based on Start time and end time by using below function and display the diarisation result by using pandas dataframe.

Thank you for reading, If you like the post ! Hit clap !

Add a comment

Related posts:

Should I Write a Book

Should I write a book is a question many people ask themselves. But, just as every person is different, each book has its reasons for entering the publishing process. John Tippets of Texas felt a…

Do You Speak Body Language?

Somewhere along the line, psychologists figured out that there was this thing they called “body language” — a non-verbal form of communication that your body knows all about, and that you probably…

Do You Ever Get Migraines?

Are your headaches so extreme that any bit of noise, light, or movement feels intolerable? If you’ve ever suffered a migraine, you know how intrusive and life-altering they are. Although this natural…