Perceptual Evaluation of Speech Quality (PESQ)

(Revised 02-11-04.  All revisions in italics)

A Discussion

Paul Ordas and Brian Fox 
Microtronix Systems Ltd.

In recent years a great deal of effort has been expended to develop methods that determine the Quality of Service (QOS) of networks through the use of comparative algorithms. These methods are designed to calculate an index value of quality that correlates to a mean opinion score given by human subjects in evaluation sessions. Typically these methods make use of a recorded speech or simulated speech stimulus. This speech stimulus is sent through the system under test and the output signal is compared to the original.

There are a number of methods available but this article is restricted to one of the more modern ones called PESQ (Perceptual Evaluation of Speech Quality). The PESQ Algorithm is designed to predict subjective opinion scores of a degraded audio sample. PESQ returns a score from 4.5 to -0.5, with higher scores indicating better quality.

PESQ is designed to analyze specific parameters of audio, including time warping, variable delays, transcoding, and noise. It is primarily intended for applications in codec evaluation and network testing.

The idea of PESQ is very appealing because it would seem that it could provide a set of automated "golden ears" to evaluate any type of audio system and give a useful indication of the "quality" of the system. Make no mistake PESQ works very well when used as intended but some big surprises await those who attempt to replace traditional telephone evaluation methods with PESQ.

At Microtronix we have been evaluating PESQ for the purposes of applying it to VoIP Telephone Testing. This has been requested by a number people and we therefore decided to evaluate it to determine how it would work for that purpose. What we found made it clear to us that although it is a useful method to incorporate into a Telephone Testing System it must be used as an adjunct to traditional methods. This is because PESQ was not designed to evaluate some of the factors that determine the "quality" of a Telephone. For example, PESQ does not take into account frequency response and loudness, two very important factors that affect the perceived quality of a telephone terminal.

In order to demonstrate this we have placed three files on this web page. The first file, OR272.WAV,  is an original file of speech in the Dutch language (Nederlands) that is provided with the ITU specification document for PESQ.

The next file, DG001.WAV, is a degraded version of the original file. It has been degraded by mixing a low level of white noise with the OR272.WAV file. This file is not audibly different from the original when heard at normal listening levels.

The third file, DG002.WAV is equalized such that there is far less low frequency and high frequency energy when compared to the original file. It is clearly audible that this speech is degraded when you listen to it yet PESQ reports the quality of DG001.WAV and DG002.WAV are the same!

Below are the reported results given by PESQ when comparing these files to OR272.WAV.

DEGRADED PESQMOS SUBJMOS COND SAMPLE_FREQ CRUDE_DELAY
dg001.wav 4.431 0.000 0 8000 -0.3600
dg002.wav 4.431 0.000 0 8000 0.1360


Both degraded files have a PESQ score of 4.431 but the file degraded by white noise is virtually indistinguishable from the original, while the file degraded by poor frequency response is audibly of lower quality.

This discussion should not be interpreted to imply that there is any flaw with PESQ.  PESQ does not attempt to define what 'quality' is; the purpose of the PESQ algorithm is to objectively predict the subjective mean opinion scores in a P.800 listening setup.  We believe that PESQ does what it was intended to do, but users of PESQ must understand the scope of ITU-T P.862.  The PESQ scope does not include effects of loudness loss (ITU-T P.862 Table 2), nor frequency response variations of less than 20 dB (ITU-T P.862 10.2.6), and it is not validated for acoustic terminal testing (ITU-T P.862 Table 3).

Listen to the Files here:

OR272.wav - Original File

DG001.wav - File Degraded with low level White Noise

DG002.wav - File Degraded by narrow band frequency response

Conclusion

PESQ can be used in addition to other methods when evaluating the performance of a telephone terminal, but PESQ alone cannot ensure good telephone quality. In order to fully evaluate a telephone it is important to use methods like those asked for in the TIA/EIA-810-A standard. Frequency Response, Loudness ratings and other traditional telephone measurements used in conjunction with PESQ can guarantee that VoIP telephones provide a quality of service that is equal to or better than conventional POTS telephones.

 

Click Here to view our new IP Phone Test System

 

Contact Microtronix or your local representative for details.