Classifying Audio Files using Linear Discriminant Analysis, Principal Components Analysis and Singular Value Decomposition

Quan Lam

The Fourier Transform help bring the content of the music file from the time domain to the frequency domain which is easier to process, contrast and compare.


the Fourier Transform is based on the discovery that it is possible to take any periodic function of time f(t) and resolve it into an equivalent infinite summation of sine waves and cosine waves with frequencies that start at 0 and increase in integer multiples of a base frequency f0 = 1/T, where T is the period of f(t). The resulting infinite series is called the Fourier Series:

MATH

The job of a Fourier Transform is to figure out all the an and bn values to produce a Fourier Series, given the base frequency and the function f(t).


returns 6 lists of frequency content broken down into small chunks for a given class of song.


With that in mind, to we are going to transform all the audio file that we need to work on using the FFT, Fast Fourier Transformation to obtain our raw data. After that we can apply the LDA method to form a training set that we can use to classify other audio file later on. Linear discriminant analysis (LDA) and the related Fisher's linear discriminant are methods used in statistics, pattern recognition and machine learning to find a linear combination of features which characterize or separate two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

As we can see in these chart below which show the frequency of the sound from the audio files as time goes. There is quite some different between the distribution of sound frequency among genres. It is mainly because each genre use a certain kind of musc intrument and each instrument has its own frequency range. That give us a chance to run a classification 

From wikipedia.org:

Consider a set of observations  \vec x  (also called features, attributes, variables or measurements) for each sample of an object or event with known class y. This set of samples is called the training set. The classification problem is then to find a good predictor for the class y of any sample of the same distribution (not necessarily from the training set) given only an observation  \vec x .

LDA approaches the problem by assuming that the conditional probability density functions p(\vec x|y=0) and p(\vec x|y=1) are both normally distributed with mean and covariance parameters \left(\vec \mu_0, \Sigma_{y=0}\right)and \left(\vec \mu_1, \Sigma_{y=1}\right), respectively. Under this assumption, the Bayes optimal solution is to predict points as being from the second class if the ratio of the log-likelihoods is below some threshold T, so that;

 (\vec x- \vec \mu_0)^T \Sigma_{y=0}^{-1} ( \vec x- \vec \mu_0)\ +\ \mathrm{ln}|\Sigma_{y=0}|\ -\ (\vec x- \vec \mu_1)^T \Sigma_{y=1}^{-1} ( \vec x- \vec \mu_1)\ -\ \mathrm{ln}|\Sigma_{y=1}| \ < \ T

Without any further assumptions, the resulting classifier is referred to as QDA (quadratic discriminant analysis). LDA also makes the simplifying homoscedastic assumption (i.e. that the class covariances are identical, so Σy = 0 = Σy = 1 = Σ) and that the covariances have full rank. In this case, several terms cancel and the above decision criterion becomes a threshold on the dot product

 \vec w \cdot \vec x < c

for some threshold constant c, where

\vec w = \Sigma^{-1} (\vec \mu_1 - \vec \mu_0)

This means that the criterion of an input  \vec x  being in a class y is purely a function of this linear combination of the known observations.

It is often useful to see this conclusion in geometrical terms: the criterion of an input  \vec x  being in a class y is purely a function of projection of multidimensional-space point  \vec x  onto direction  \vec w . In other words, the observation belongs to y if corresponding  \vec x  is located on a certain side of a hyperplane perpendicular to  \vec w . The location of the plane is defined by the threshold c.


The objective of LDA is to find out the optimal transformation matrix so the ratio of between class scatter matrix and within class scatter matrix reaches to its maximum. However, as processing a high quality audio file could take a lot of time, using a large sample size is not ideal. Using a small sample size raise the problem that the within class scatter mattrix would not tend to be a singular matrix and so the accuracy of the algorithm goes down. To overcome this problem, people has been using LDA in combianation with SVD, Singular Value Decomposition.


 

Formally, the singular value decomposition of an m×n real or complex matrix M is a factorization of the form

 

M = U\Sigma V^*, \,

 

where U is an m×m real or complex unitary matrix, Σ is an m×n diagonal matrix with nonnegative real numbers on the diagonal, and V* (the conjugate transpose of V) is an n×n real or complex unitary matrix. The diagonal entries Σi,i of Σ are known as the singular values of M. The m columns of U and the n columns ofV are called the left singular vectors and right singular vectors of M, respectively.

 

Singular value decomposition and eigendecomposition are closely related. Namely:

  • The left singular vectors of M are eigenvectors of MM * .
  • The right singular vectors of M are eigenvectors of M * M.
  • The non-zero singular values of Σ are the square roots of the non-zero eigenvalues of M * M or MM * .

Applications which employ the SVD include computing the pseudoinverse, least squares fitting of data, matrix approximation, and determining the rank, range and null space of a matrix.

 The key idea here is to diagonalized the covariance matrix so that all the redundancies would be removed, and the largest variances of particular measurements are ordered. In other words the system has been written in terms of its principle components, which is mainly the idea of PCA, Principal Component Analisys.

 


Reading in a audio file and turned it into an array of signal

{{{id=1| #Input: a name of a song in the database #Output: the content of each frame and the number of frames of the file def readAudioFile(name): import wave import array sample = wave.open(DATA+name,'r') #Converting content to integers frames = array.array('f',sample.readframes(sample.getnframes())) f = sample.getframerate() sample.close() return [frames,f] /// }}}

 

The original idea was that a signal can be divided up into 6 separate signals, each consisting of the frequency content of the original signal from a certain range. This has the general effect of separating “notes” from different instrument groups and allowing them to be analyzed separately. We can chose to follow the method as an example, broking our signal into the bands 0-200Hz, 200-400Hz, 400-800Hz, 800-1600Hz, 1600-3200Hz, and finally above 3200Hz. This gave us a total of 6 bands to work with. We can do seperate classification on different band and combine the result for a final conclution, however this has not been successfully implemented.

 

a signal can be divided up into 6 separate signals, each consisting of the frequency content of the original signal from a certain range. This has the general effect of separating “notes” from different instrument groups and allowing them to be analyzed separately. We chose to follow the method outlined in Scheirer, 1998 and broke our signal into the bands 0-200Hz, 200-400Hz, 400-800Hz, 800-1600Hz, 1600-3200Hz, and finally above 3200Hz. This gave us a total of 6 bands to work with. The choice of these 6 bands also enables us to change to different type of music easiA signal can be divided up into 6 separate signals, each consisting of the frequency content of the original signal from a certain range. This has the general effect of separating “notes” from different instrument groups and allowing them to be analyzed separately. We chose to follow the method outlined in Scheirer, 1998 and broke our signal into the bands 0-200Hz, 200-400Hz, 400-800Hz, 800-1600Hz, 1600-3200Hz, and finally above 3200Hz. This gave us a total of 6 bands to work with. The choice of these 6 bands also enables us to change to different type of music easily.

 

 

{{{id=15| #Attemp to break signal into 6 different smaller sample import numpy #fS: frequency of the sample #tS: number of sample def filterFreq1(data, fS, tS): smallT = 0.05 md = data.shape[0] nd = data.shape[1] k = 1 d1 = data[0] L = len(numpy.fft.fft(d1[:int(smallT*fS)])) tslide = range(tS*fS) dT = smallT*fS for i in range(tS*fS-1): tslide[i+1] = tslide[i]+dT x1 = numpy.zeros((L,md)) x2 = numpy.zeros((L,md)) x3 = numpy.zeros((L,md)) x4 = numpy.zeros((L,md)) x5 = numpy.zeros((L,md)) x6 = numpy.zeros((L,md)) for j in range(md): dj = data[j] count = 0 signal_bands = numpy.zeros((L,6)); s1 = range(L); s2 = range(L); s3 = range(L); s4 = range(L); s5 = range(L); s6 = range(L); for jt in range(len(tslide)-1): signal = dj[int(tslide[jt]):int(tslide[jt+1])-1] if len(signal)==0: break temp = filterSignal(signal,fS) if signal_bands.shape == temp.shape: signal_bands = signal_bands + temp count = count + 1 if count == 5: signal_bands = signal_bands/5 s1 = signal_bands[:,0] s2 = signal_bands[:,1] s3 = signal_bands[:,2] s4 = signal_bands[:,3] s5 = signal_bands[:,4] s6 = signal_bands[:,5] count = 0 k = k + 1 signal_bands = numpy.zeros((L,6)) x1[:,j] = s1; x2[:,j] = s2; x3[:,j] = s3; x4[:,j] = s4; x5[:,j] = s5; x6[:,j] = s6; return [x1,x2,x3,x4,x5,x6] /// }}}

Run FFT on a given signal and return the result

{{{id=34| import numpy def filterSignal(sig, maxfreq): dft = numpy.fft.fft(sig); return dft /// }}}

 

returns 6 lists of frequency content broken down into small chunks for a given class of song.

 

Return list of frequency-domain content for a given class of song.

 

 

{{{id=31| import numpy #fS: frequency of the sample #tS: number of sample def filterFreq(data, fS, tS): md = data.shape[0] nd = data.shape[1] d1 = data[0] L = len(numpy.fft.fft(d1[:int(smallT*fS)])) tslide = range(tS*fS) processedData = [] for j in range(md): processedData.append(filterSignal(data[j],fS)) return processedData /// }}}


performs the PCA and LDA and produces the matrix U for projection and the threshold value for comparison.

 

Performs the PCA and LDA and produces the matrix U for projection and the threshold value for comparison.

 

 

{{{id=16| import numpy import copy import scipy as Sci import scipy.linalg def rb_trainer(type1data,type2data, feature): n1=len(type1data[0]) n2=len(type2data[0]) data = numpy.concatenate((type1data,type2data),0,) #Producing covariance matrix data = transpose(data) #Running SVD result = numpy.linalg.svd(data,full_matrices=False) u = result[0] s = result[1] v = result[2] print u print s print v #Diagonalizing the result s2 = numpy.diag(s) songs = matrix(s2)*matrix(transpose(v)); type1 = songs[0:feature,0:n1]; type2 = songs[0:feature,n1:n1+n2]; m11 = numpy.mean(type1,1); m1 = numpy.zeros((len(m11),1)) for i in range(len(m11)): m1[i,0] = m11[i] m22 = numpy.mean(type2,1); m2 = numpy.zeros((len(m22),1)) print m1 print m2 for i in range(len(m22)): m2[i,0] = m22[i] sw = matrix(numpy.zeros((n1+n2,n1+n2))) #within class variances for j in range(n1): t1 = type1[:,j] sw = sw + matrix(t1-m1)*matrix(transpose(t1-m1)) for j in range(n2): t2 = type2[:,j] sw = sw + matrix(t2-m2)*matrix(transpose(t2-m2)) m = numpy.zeros((len(m1),1)) print len(m1) print len(m2) print len(m) for i in range(len(m1)-1): m[i,0] = m1[i]-m2[i] m = matrix(m) sb = m*transpose(m) #between class print sb print sw v2,d = Sci.linalg.eig(sb,sw); #linear discriminant analysis print v2 print d lamb = d[0,0] index = 0 for i in range(len(d)): if d[i,i]>lamb: lamb = d[i,i] index = i w = range(len(d)) for i in range(len(v2)): w[i] = v2[i][index] w = w/norm(w,2); vt1 = matrix(transpose(w))*matrix(type1) vt2 = matrix(transpose(v))*matrix(type2) result = numpy.concatenate(vt1,vt2); if mean(v1)>mean(v2): w = -w; vt1 = -vt1; vt2 = -vt2; #type1 < threshold < type2 #Sorting the principal components sort1 = sort(vt1); sort2 = sort(vt2); t1 = length(sort1); t2 = 1; while sort1[t1]>sort2[t2]: t1 = t1-1 t2 = t2+1 th = (sort1[t1]+sort2[t2])/2; return [result,w,u,s,v,th] /// }}} {{{id=23| import numpy def sampling(song_names, timeSampling, frequency): time1 = timeSampling yData = numpy.zeros((len(song_names),time1*frequency)) for j in range(len(song_names)): data = readAudioFile(song_names[j]) for jj in range(len(yData[0])): yData[j,jj] = data[0][jj]; return yData /// }}} {{{id=21| import numpy import copy freqSampling = 44100 timeSampling = 0.8 feature = 40 classical_songs = ["sample1.wav"] rock_songs = ["sample2.wav"] blue_songs = ["sample3.wav"] test_songs = ["sample2,wav"] typeName = ['classical', 'rock', 'blue'] rockData = sampling(rock_songs, timeSampling, freqSampling) blueData = sampling(blue_songs, timeSampling, freqSampling) classicalData = sampling(classical_songs,timeSampling, freqSampling) print "Training" #rProcessedData = filterFreq(rockData, freqSampling, timeSampling); #bProcessedData = filterFreq(blueData, freqSampling, timeSampling); #cProcessedData = filterFreq(classicalData, freqSampling, timeSampling); #rtS = rProcessedData[1]; btS = bProcessedData[1]; ctS = cProcessedData[1]; #LDA #resultArray1 = rb_trainer(ctS, rtS, feature); #resultArray2 = rb_trainer(ctS, btS, feature); #[result3,w3,U3,S3,V3,threshold3] = rb_trainer(rtS, btS, feature); #resultArray3 = rb_trainer(rtS, btS, feature); resultArray1 = rb_trainer(classicalData, rockData, feature); resultArray2 = rb_trainer(classicalData, blueData , feature); resultArray3 = rb_trainer(rockData, blueData , feature); print "Testing" listID = zeros(1, len(test_songs)); for j in range(len(test_songs)): aSong = sampling([test_songs[j]],timeSampling); for jj in range(3): testProcessedData = filterFreq(aSong[jj,:], freqSampling, timeSampling); songTS[jj] = copy.deepcopy(testProcessedData[3]); end #classical vs rock pval = []; test_mat = []; for jj in range(3): test_matrix = matrix(transpose(resultArray1[2]))*matrix(songTS[jj]) pval = pval + matrix(transpose(resultArrat1[1]))*matrix(test_matrix) pval = pval/3 if pval < resultArray1: #threshold 1 listID[j] = 1 else: #classical vs blue pval = [] test_mat = [] for jj in range(3): test_matrix = matrix(transpose(resultArray2[2]))*matrix(songTS[jj]) pval = pval + matrix(transpose(resultArray2[1]))*matrix(test_matrix) pval = pval/3; if pval < resultArray2[5]: #threshold 2 listID[j] = 1; else: #rock vs blue pval = 0 test_mat = [] for jj in range(3): test_mat = numpy.inner(transpose(resultArray3[2]),songTS[jj]) pval = pval + numpy.inner(transpose(resultArray3[1]),test_matrix); end pval = pval/3; if pval < resultArray2[5]: #threshold 2 listID[j] = 2; else: listID[j] = 3; print "Done" print "Summary:" for i in range(len(test_songs)): print test_songs(i) print listID[i] /// Training }}}

The Results:

- I haven't been able to optimizing the algorithm to the satisfying level. Some attemp coorporating R has failed mostly because I have difficulty making R talk to Sage effectively on procedure and data. The based version written on numpy worked extremely slow with unsure result returned.

- The numpy library even though interact well with Sage, doesn't support heavy proccessing of matrix very well. Their SVD function can fail sometimes even though in theory it should not. I have tried using the matlab/octave library instead but I had difficulty integrating them into the project.

Potential Development:

- The algorithm can be further optimized by rewritten many of its component using other libraries to abuse their strength.

- Instead of processing a file as a whole, we can divided it into a lot of smaller randomly picked chunks for faster processing time for the same result and accuracy, and we can also revisit the idea of dividing the data into different frequency bands to target difference properties of different music genre more accurately.

Reference:

 

References

“Audio Topics: The Frequencies of Music” PBS International 633 granite Court http://www.psbspeakers.com/audioTopics.php?fpId=8&page_num=1&start=0

Cheng, Kileen. Nazer, Bobak. Uppuluri, Jyoti. Verret, Ryan. “Beat This A Synchronization Project.” http://www.owlnet.rice.edu/~elec301/Projects01/beat_sync/beatalgo.html
Jimaa, Shihab. Krishnan, Sridhar. Umapathy, Karthikeyan. “Multigroup Classification of Audio Signals Using Time-Frequency Parameters.” http://ieeexplore.ieee.org/iel5/6046/30529/01407903.pdf?tp=&arnumber=1407903&isnumber=30529

“Audio Topics: The Frequencies of Music” PBS International 633 granite Court http://www.psbspeakers.com/audioTopics.php?fpId=8&page_num=1&start=0

Cheng, Kileen. Nazer, Bobak. Uppuluri, Jyoti. Verret, Ryan. “Beat This A Synchronization Project.” http://www.owlnet.rice.edu/~elec301/Projects01/beat_sync/beatalgo.html

Jimaa, Shihab. Krishnan, Sridhar. Umapathy, Karthikeyan. “Multigroup Classification of Audio Signals Using Time-Frequency Parameters.” http://ieeexplore.ieee.org/iel5/6046/30529/01407903.pdf?tp=&arnumber=1407903&isnumber=30529

Neeta Nain, Prashant Gour, Nitish Agarwal, Rakes P Talawar, Subhash Chandra "Face recognization using LDA, PCA and SVD". citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.148.9268