Lecture 7: Audio Compression And Coding

Transcription

EE E6820: Speech & Audio Processing & RecognitionLecture 7:Audio compression and codingDan Ellis dpwe@ee.columbia.edu Michael Mandel mim@ee.columbia.edu Columbia University Dept. of Electrical Engineeringhttp://www.ee.columbia.edu/ dpwe/e6820March 31, 20091Information, Compression & Quantization2Speech coding3Wide-Bandwidth Audio CodingE6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 20091 / 37

Outline1Information, Compression & Quantization2Speech coding3Wide-Bandwidth Audio CodingE6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 20092 / 37

Compression & QuantizationHow big is audio data? What is the bitrate?Fs frames/second (e.g. 8000 or 44100)x C samples/frame (e.g. 1 or 2 channels)x B bits/sample (e.g. 8 or 16) Fs · C · B bits/second (e.g. 64 Kbps or 1.4 Mbps)bits / frameICD Audio 1.4 Mbps328Mobile!13 Kbps800044100frames / secTelephony 64 KbpsHow to reduce?IIIlower sampling rate less bandwidth (muffled)lower channel count no stereo imagelower sample size quantization noiseOr: use data compressionE6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 20093 / 37

Data compression:Redundancy vs. IrrelevanceTwo main principles in compression:IIremove redundant informationremove irrelevant informationRedundant information is implicit in remaindere.g. signal bandlimited to 20kHz,but sample at 80kHz can recover every other sample by interpolation:IsampleIn a bandlimited signal, the red samplescan be exactly recovered by interpolatingthe blue samplestimeIrrelevant info is unique but unnecessaryIe.g. recording a microphone signal at 80 kHz sampling rateE6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 20094 / 37

Irrelevant data in audio codingFor coding of audio signals,irrelevant means perceptually insignificantIan empirical propertyCompact Disc standard is adequate:II44 kHz sampling for 20 kHz bandwidth16 bit linear samples for 96 dB peak SNRReflect limits of human sensitivity:III20 kHz bandwidth, 100 dB intensitysinusoid phase, detail of noise structuredynamic properties - hard to characterizeProblem: separating salient & irrelevantE6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 20095 / 37

QuantizationRepresent waveform with discrete levelsQ[x]543210-1-2-3-4-5-5 -4 -3 -2 -1 0 1 2 3 4 56x[n]Q[x[n]]420-2error e[n] x[n] - Q[x[n]]0510152025303540xEquivalent to adding error e[n]:x[n] Q [x[n]] e[n]e[n] uncorrelated, uniform white noisep(e[n]) variance σe2 -D/2D212 D/2E6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 20096 / 37

Quantization noise (Q-noise)Uncorrelated noise has flat spectrumWith a B bit word and a quantization step DIImax signal range (x) (2B 1 ) · D . . . (2B 1 1) · Dquantization noise (e) D/2 . . . D/2 Best signal-to-noise ratio (power)SNR E [x 2 ]/E [e 2 ] (2B )2. or, in dB, 20 · log10 2 · B 6 · B dB0level / dBQuantized at 7 bits-20-40-60-8001000E6820 (Ellis & Mandel)200030004000500060007000 freq / HzL7: Audio compression and codingMarch 31, 20097 / 37

Redundant informationRedundancy removal is losslessSignal correlation implies redundant informationIIe.g. if x[n] x[n 1] v [n]x[n] has a greater amplitude range uses more bits than v [n]sending v [n] x[n] x[n 1] can reduce amplitude, hencebitratex[n] - x[n-1]Ibut: ‘white noise’ sequence has no redundancy .Problem: separating unique & redundantE6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 20098 / 37

Optimal codingShannon information:An unlikely occurrence is more ‘informative’p(A) 0.5p(B) 0.5ABBBBAAABBABBABBABBp(A) 0.9p(B) 0.1AAAAABBAAAAAABAAAABA, B equiprobableA is expected;B is ‘big news’Information in bits I log2 (probability )Iclearly works when all possibilities equiprobableOptimal bitrate av.token length entropy H E [I ]I. equal-length tokens are equally likelyHow to achieve this?IIItransform signal to have uniform pdfnonuniform quantization for equiprobable tokensvariable-length tokens Huffman codingE6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 20099 / 37

Quantization for optimum bitrateQuantization should reflect pdf of signal:p(x x0)p(x x0)x'1.00.80.60.40.2-0.02 -0.015 -0.01 -0.005I00.005 0.010.015 0.020.025x0cumulative pdf p(x x0 ) maps to uniform x 0Or, codeword length per Shannon log2 (p(x)):-0.02-0.0100.010.020.03Ip(x)Shannon info / bits 11111110xx02468Huffman coding: tree-structured decoderE6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 200910 / 37

Vector QuantizationQuantize mutually dependent values in joint space:3x1210-1-2x2-6-4-2024May help even if values are largely independentIlarger space x1,x2 is easier for HuffmanE6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 200911 / 37

Compression & RepresentationAs always, success depends on representationAppropriate domain may be ‘naturally’ bandlimitedIIe.g. vocal-tract-shape coefficientscan reduce sampling rate without data lossIn right domain, irrelevance is easier to ‘get at’Ie.g. STFT to separate magnitude and phaseE6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 200912 / 37

Aside: Coding standardsCoding only useful if recipient knows the code!Standardization efforts are importantFederal Standards: Low bit-rate secure voice:IIFS1015e: LPC-10 2.4 KbpsFS1016: 4.8 Kbps CELPITU G.x series (also H.x for video)IIG.726 ADPCMG.729 Low delay CELPMPEGIIIMPEG-Audio layers 1,2,3 (mp3)MPEG 2 Advanced Audio Codec (AAC)MPEG 4 Synthetic-Natural Hybrid CodecMore recent ‘standards’IIproprietary: WMA, Skype.Speex .E6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 200913 / 37

Outline1Information, Compression & Quantization2Speech coding3Wide-Bandwidth Audio CodingE6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 200914 / 37

Speech codingStandard voice channel:IIanalog: 4 kHz slot ( 40 dB SNR)digital: 64 Kbps 8 bit µ-law x 8 kHzHow to compress?RedundantIsignal assumed to be a single voice,not any possible waveformIrrelevantIneed code only enough for intelligibility, speaker identification(c/w analog channel)Specifically, source-filter decompositionIvocal tract & f0 change slowlyApplications:IIlive communicationsoffline storageE6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 200915 / 37

Channel Vocoder (1940s-1960s)Basic source-filter decompositionIIfilterbank breaks into spectral bandstransmit slowly-changing energy in each bandEncoderBandpassfilter 1DecoderSmoothedenergyDownsample& encodeE1Bandpassfilter 1OutputInputBandpassfilter NSmoothedenergyVoicinganalysisIDownsample& encodeENBandpassfilter 1V/UVPitchPulsegeneratorNoisesource10-20 bands, perceptually spacedDownsampling?Excitation?IIpitch / noise modelor: baseband ‘flattening’.E6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 200916 / 37

LPC encodingThe classic source-filter modelEncoderDecoderFilter coefficients {ai} 1/A(ej!) Inputs[n]Represent& encodefLPCanalysisRepresent& encodeResiduale[n]Excitationgenerator e[n]All-polefilterH(z) tOutput s[n]11 - "aiz-iCompression gains:IIfilter parameters are slowly changingexcitation can be represented many ways20 msFilterparametersExcitation/pitchparameters5 msE6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 200917 / 37

Encoding LPC filter parametersFor ‘communications quality’:III8 kHz sampling (4 kHz bandwidth) 10th order LPC (up to 5 pole pairs)update every 20-30 ms 300 - 500 param/sRepresentation & quantizationaiIII{ai } - poor distribution,can’t interpolatereflection coefficients {ki }:guaranteed fLiLSPs - lovely!00.050.10.150.20.250.30.350.40.450.5Bit allocation (filter):IIGSM (13 kbps): 8 LARs x 3-6 bits / 20 ms 1.8 KbpsFS1016 (4.8 kbps): 10 LSPs x 3-4 bits / 30 ms 1.1 KbpsE6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 200918 / 37

Line Spectral Pairs (LSPs)LSPs encode LPC filter by a set of frequenciesExcellent for quantization & interpolationDefinition: zeros ofP(z) A(z) z p 1 · A(z 1 )Q(z) A(z) z p 1 · A(z 1 )IIIIz e jω z 1 e jω A(z) A(z 1 ) on u.circ.P(z), Q(z) have (interleaved) zeros when 1 {A(z)} {z p 1 A(zQ)}reconstruct P(z), Q(z) i (1 ζi z 1 ) etc.A(z) [P(z) Q(z)]/2A(z-1) 0A(z) 0Q(z) 0P(z) 0E6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 200919 / 37

Encoding LPC excitationExcitation already better than raw signal:50000-5000100Original signal0-100 LPC residual1.3 1.351.4 1.45I1.51.551.61.65time / s1.7save several bits/sample, but still 32 KbpsCrude model: U/V flag pitch periodI 7 bits / 5 ms 1.4 Kbps LPC10 @ 2.4 KbpsPitch period values16 ms frame 1.75time / sBand-limit then re-extend (RELP)E6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 200920 / 37

Encoding excitationSomething between full-quality residual (32 Kbps)and pitch parameters (1.4 kbps)?‘Analysis by synthesis’ loop:InputFilter coefficientsA(z)LPCanalysiss[n]Excitation ctor b·z–NL‘Perceptual’weightingW(z) A(z)A(z/!)MSErrorminimization x[n]LPC filter1A(z)– x[n]*ha[n] - s[n]Excitation encoding‘Perceptual’ weighting discounts peaks:gain / dB20100W(z) A(z)A(z/!)1/A(z/!)-10-200E6820 (Ellis & Mandel)1/A(z)5001000150020002500L7: Audio compression and coding30003500freq / HzMarch 31, 200921 / 37

Multi-Pulse Excitation (MPE-LPC)Stylize excitation as N discrete pulses15original 406080100120time / sampsencode as N (ti , mi ) pairsGreedy algorithm places one pulse at a time: A(z) X (z)Epcp S(z)A(z/γ) A(z)X (z)R(z) A(z/γ) A(z/γ)IIR(z) is residual of target waveform after inverse-filteringcross-correlate hγ and r hγ , iterateE6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 200922 / 37

CELPRepresent excitation with codebooke.g. 512 sparse excitation 0080090100110120130140150Iexcitation60 time / sampleslinear search for minimum weighted error?FS1016 4.8 Kbps CELP (30ms frame 144 bits):10 LSPs4x4 6x3 bits 34 bitsPitch delay4 x 7 bits 28 bitsPitch gain4 x 5 bits 20 bitsI 138 bitsCodebk index 4 x 9 bits 36 bitsCodebk gain4 x 5 bits 20 bitsE6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 200923 / 37

Aside: CELP for nonspeech?CELP is sometimes called a ‘hybrid’ coder:IIoriginally based on source-filter voice modelCELP residual is waveform coding (no model)freq / kHzOriginal (mrZebra-8k)432CELP does notbreak with multiplevoices etc.105 kbps buzz-hiss432Ijust does thebest it can104.8 kbps CELP43210012345678time / secLPC filter models vocal tract;also matches auditory system?Ii.e. the ‘source-filter’ separation is good forrelevance as well as redundancy?E6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 200924 / 37

Outline1Information, Compression & Quantization2Speech coding3Wide-Bandwidth Audio CodingE6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 200925 / 37

Wide-Bandwidth Audio CodingGoals:IItransparent coding i.e. no perceptible effectgeneral purpose - handles any signalSimple approaches (redundancy removal)IAdaptive Differential PCM (ADPCM)X[n] –Xp[n]I D[n]AdaptivequantizerC[n] Q[D[n]]PredictorXp[n] F[X'[n-i]]C[n]X'[n] D'[n]DequantizerD'[n] Q-1[C[n]]as prediction gets smarter, becomes LPCe.g. shorten - lossless LPC encodingLarger compression gains needs irrelevanceIhide quantization noise with psychoacoustic maskingE6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 200926 / 37

Noise shapingPlain Q-noise sounds like added white noiseIIactually, not all that disturbing. but worst-case for exploiting masking1Have Q-noise scale withsignal levelmu-lawquantization0.80.6IIi.e. quantizer step getslarger with amplitudemu(x)0.4minimum distortion forsome center-heavy pdf0.2x - mu(x)000.20.40.60.81Or: put Q-noise around peaks in spectrumkey to getting benefit of perceptual maskinglevel / dBI20Signal0Linear Q-noise-20-40-60E6820 (Ellis & Mandel)0Transform Q-noise 1500200025003000L7: Audio compression and coding3500freq / HzMarch 31, 200927 / 37

Subband codingIdea: Quantize separately in separate isfiltersQ-1[ ]MMfReconstructionfilters(M channels)fIQuantizeQ[ ]Q-1[ ]Q[ ]Output MQ-noise stays within band, gets maskedgain / dB‘Critical sampling’ 1/M of spectrum per bandI0-10-20-30-40-500alias energy0.10.20.30.40.50.60.7normalized freq 1some aliasing inevitableTrick is to cancel with alias of adjacent band ‘quadrature-mirror’ filtersE6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 200928 / 37

MPEG-Audio (layer I, II)Basic idea: Subband codingplus psychoacoustic masking modelto choose dynamic Q-levels in subbandsScale indicesScalenormalizeInput32 bandpolyphasefilterbankFormat Bitstream& packbitstreamData uantizeControl & scalesQuantize32 chansx 36 samplesPer-bandmasking margins24 ms22 kHz 32 equal bands 690 Hz bandwidth8 / 24 ms frames 12 / 36 subband samplesfixed bitrates 32 - 256 Kbps/chan (1-6 bits/samp)scale factors are like LPC envelope?E6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 200929 / 37

MPEG Psychoacoustic modelBased on simultaneous masking experimentsDifficulties:ProcedureIpick ‘tonal peaks’ in NB FFTspectrumIremaining energy ‘noisy’peaksIapply nonlinear ‘spreadingfunction’Isum all masking & threshold inpower domainE6820 (Ellis & Mandel)SPL / dBInoise energy masks 10 dB better than tonesmasking level nonlinear in frequency & intensitycomplex, dynamic sounds not well understoodSPL / dBISPL / esultantmasking13L7: Audio compression and coding57911 13 15freq / Bark171921March 31, 200930 / 37

MPEG Bit allocationResult of psychoacoustic model ismaximum tolerable noise per subbandSNR 6·BlevelMasking toneMaskedthresholdSafe noise levelfreqQuantization noiseSubband NIsafe noise level required SNR bits BBit allocation procedure (fixed bit rate):IIIpick channel with worst noise-masker ratioimprove its quantization by one steprepeat while more bits available for this frameBands with no signal above masking curve can be skippedE6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 200931 / 37

MPEG Audio Layer III‘Transform coder’ on top of ‘subband coder’LineSubband57531FilterbankMDCT32 Subbands00WindowSwitchingCodedAudio Signal192 kbit/sDistortionControl LoopNonuniformQuantizationRateControl LoopHuffmanEncodingCoding ofSideinformationBitstream FormattingCRC-CheckDigital AudioSignal (PCM)(768 kbit/s)32 kbit/sPsychoacousticModelFFT1024 PointsExternal ControlBlocks of 36 subband time-domain samples become 18 pairsof frequency-domain samplesIIImore redundancy in spectral domainfiner control e.g. of aliasing, maskingscale factors now in band-blocksFixed Huffman tables optimized for audio dataPower-law quantizerE6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 200932 / 37

Adaptive time windowTime window relies on temporal maskingIsingle quantization level over 8-24 ms window‘Nightmare’ scenario:Pre-echo distortionI‘backward masking’ saves in most casesAdaptive switching of time window:normalwindow level10.60.40.200E6820 (Ellis & Mandel)transition short0.820406080100 time / msL7: Audio compression and codingMarch 31, 200933 / 37

The effects of MP3Before & after:freq / kHzJosie - direct from CD20151050freq / kHzAfter MP3 encode (160 kbps) and decode20151050freq / kHzResidual (after aligning 1148 sample delay)201510500III246810time / secchop off high frequency (above 16 kHz)occasional other time-frequency ‘holes’quantization noise under signalE6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 200934 / 37

MP3 & BeyondMP3 is ‘transparent’ at 128 Kbps for stereo(11x smaller than 1.4 Mbps CD rate)Ionly decoder is standardized:better psychological models better encodersMPEG2 AACIIIrebuild of MP3 without backwards compatibility30% better (stereo at 96 Kbps?)multichannel etc.MPEG4-AudioIIwide range of component encodingsMPEG Audio, LSPs, .SAOLIII‘synthetic’ component of MPEG-4 Audiocomplete DSP/computer music language!how to encode into it?E6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 200935 / 37

SummaryFor coding, every bit countsIItake care over quantization domain & effectsShannon limits.Speech codingIILPC modeling is old but goodCELP residual modeling can go beyond speechWide-band codingIInoise shaping ‘hides’ quantization noisedetailed psychoacoustic models are keyE6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 200936 / 37

ReferencesBen Gold and Nelson Morgan. Speech and Audio Signal Processing: Processing andPerception of Speech and Music. Wiley, July 1999. ISBN 0471351547.D. Pan. A tutorial on MPEG/audio compression. IEEE Multimedia, 2(2):60–74, 1995.Karlheinz Brandenburg. MP3 and AAC explained. In Proc. AES 17th InternationalConference on High Quality Audio Coding, pages 1–12, 1999.T. Painter and A. Spanias. Perceptual coding of digital audio. Proc. IEEE, 80(4):451–513, Apr. 2000.E6820 (Ellis & Mandel)L7: Audio compression and codingMarch 31, 200937 / 37

Irrelevantdata in audio coding For coding of audio signals, irrelevantmeansperceptually insigni cant I an empirical property Compact Disc standard is adequate: I 44 kHz sampling for 20 kHz bandwidth I 16 bit linear sample