Pohlmann (Ken) Principles of Digital Audio Summary

Digital Audio CD and Other Selected Digital Technologies
Based on Principles of Digital Audio, 4th Ed.
by Ken C. Pohlmann and Other Sources
Summary by Michael McGoodwin, prepared 2000

Digital Audio: Instruments group, from Music Graphics Galore

I prepared this summary primarily for my own benefit to improve my understanding of these important technologies. You may find it useful particularly in searching for relevant technical terms and acronyms in order to find their meaning in context. I have not attempted to prevent a comprehensive summary—only a listing of selected key, confusing, or complex points, specifications, design features and shortcomings, etc. which are of interest to me. I would be pleased to be notified of any errors.

I highly recommend the Pohlmann text to any technically-minded individual interested in improving his or her knowledge about these current and emerging technologies. Although the emphasis of the book is on audio technologies, there is much useful information also about video, cinefilm audio, DVD (Digital Versatile Disk), and other digital technologies. Page and chapter references are to that text unless otherwise indicated. I have attempted to translate measurements to a uniform set of units and have corrected several errors found in the text.

Chapter 1: Sound and Numbers
[Brief review of physical principles of sound and relevant mathematics]

Reviews fundamental sound principles including nodes, antinodes, frequency, wavelength, bandwidth, diffraction, sound pressure level, Intensity level IL of sound in decibels dB (= 10 times the log of the power ratio or 20 times the log of the pressure level), reference sound pressure level (defined to be that at the threshold of hearing or 0.0002 dyne/cm², used as the denominator in dB calculations of sound levels), fundamental frequency and harmonics, periodic and aperiodic waveforms, timbre and overtones, Fourier theorem.

Binary number system invented by Gottfried von Leibnitz 1679 Fractional values can be expressed with a binary point just like decimal point. Methods of designating negative numbers include signed magnitude (easy for humans to understand, but faulty by having both plus and minus zeros), two's complement (more efficient for computers by allowing subtraction to be computed by an addition of the 2-s complement of the subtrahend). Binary coded decimal BCD "8-4-2-1 code" (in which each decimal digit is coded as 4 binary bits having weights or values of 8, 4, 2 and 1). Boolean algebra (George Boole 1854) and symbolic circuit representation of AND, OR, NOT (complement), XOR, NAND, NOR.

Chapter 2: Fundamentals of Digital Audio
[Review of basic overall principles]

In analog recording systems, the continuously varying amplitude of the sound waveform is translated to a continuously varying level of magnetism, LP groove amplitude, etc. In contrast, with digital audio encoding, discrete (noncontinuous) time sampling and amplitude quantization is required, a process that breaks the originally continuous and smooth waveform into a staircase or other pattern of pulses.

Sampling: Sampling theorem of Harry Nyquist 1928 (and his precursors) states that a continuous band-limited signal may be exactly represented (without any loss of data) by amplitude samples made at a sampling frequency S equal to twice the highest signal frequency component. In other words, the Nyquist (half-sampling) frequency S/2 is the highest freq. that can be accurately represented by a sampling frequency of S. The reconstruction of the original input analog signal is accomplished by an output lowpass filter which interpolates the staircase waveform pattern of the digital signal. Sampling frequency determines the audio bandwidth of the system. CDs use S = 44.1 kHz and DVDs and other newer audio systems may use 48 or higher.

Lowpass filter: If the signal contains frequencies higher than S/2, aliasing ("foldover") distortion will result in which false frequencies lower than S/2 will result. Since it is not usually possible to have a cutoff filter cut off precisely ("brick wall" type) at S/2, it is necessary to have a guard band extending below S/2. Prevention of aliasing requires the analog input signal to be passed through an analog lowpass antialiasing filter before sampling to insure that no higher frequencies are present.

Quantization (A/D conversion): Audio signals (as for CDs) have usually been coded linearly rather than say logarithmically. Digitization can never perfectly encode a signal, it is only an approximation due to measurement error (and moreover, the precise capabilities of human hearing are not known). 16 bits were originally thought to suffice for audio applications while minimizing storage requirements but 20 bits of measurement range provides better (and probably adequate) measurement accuracy—DVDs can code up to 24-bit words. Quantization error is limited to 1/2 the LSB (least significant bit, i.e., the smallest possible quantized signal increment corresponding to a 1-bit interval). In digital systems, the intrinsic signal/error ratio (S/E, where S is the maximum expressible/recordable signal) is used analogous to the S/N ratio and can be computed as ~92 dB for 15-bit, ~98 dB for 16-bit quantization, ~104 dB for 24-bit, (each bit adds ~6 dB). [Note that S/E is a property deriving from the mathematics of digitization and depends only on the bit depth, not the specific circuitry etc.] At lower signal levels, quantization error is "correlated" (undesirable) with the signal and termed quantization distortion rather than noise (which is uncorrelated or random). The percentage of distortion increases as signal strength decreases... Low level signal quantization can induce aliased components ("granulation noise" and beat tones or "birdies") which may be unpleasantly audible. Dither must be added to prevent these problems (or sigma delta modulation and noise shaping used). Dithering adds, prior to sampling, a small amount of noise that is uncorrelated with the signal. This increases total noise in the form of white noise but reduces distortion. Dither effectively permits encoding below the LSB level (e.g., by pulse width modulation, in which data is encoded in the width of the pulses]. The best dither probability distribution function pdf for listening purposes is a triangular pdf extending from ±1 LSB, which increases total noise by 4.77 dB (Gaussian and rectangular are also used). . .... The word dither derives from ME "didderen" meaning "to tremble"—it was found that WWII airplane bomb sights perfomed better when vibrating than when still, similar to the improvement in accuracy gained from tapping a meter.

Chapter 3: Digital Audio Recording
[Overview of steps involved in digital recording]

In the analog realm, modulation types includes AM and FM. Types of digital modulation include PWM pulse width modulation (pulses of variable width, uniform height), PPM pulse position modulation (pulses onset is variable), PAM pulse amplitude (pulses have variable amplitude, uniform timing and width), PNM pulse number (very narrow pulses firing at varible rate, uniform pulse height and width), and PCM pulse code modulation {p. 50}.

PCM is the most commonly used (usually linear PCM or LPCM, as with CDs), devised 1937 by Alec Reeves for telephony. The quantized signal is coded with a look up table of optimized binary codes and various further modulation methods employed (such as EFM/NRZ/NRZI for CDs).

A typical recording system includes input amplifier, dither generator, input lowpass filter, sample-and-hold circuit, analog to digital converter ADC (newer generation designs usually combine mild lowpass filtering, oversampling, and decimation processing), multiplexer (to combine channels), digital processing and modulation circuit (channel coding), and a storage medium.

The input lowpass filter passes the passband frequencies and blocks the stopband—the border between the two is called the guard band, which is positioned in the region of S/2 (the Nyquist or half-sampling frequency). Filters may introduce phase distortion. Many are described by their mathematical polynomials: Bessel, Butterworth, Chebyshev, etc. Filters may have several cascaded stages or orders which improve the approximation to brick-wall filtering but worsen phase shift. No filter is ideal in all respects. Oversampling and sigma delta conversion allows a more gradual cutoff frequency and has mostly replaced analog brick wall filtering.

The sample-and-hold (S/H) circuit takes samples of the analog signal at precise time intervals and holds them as in a capacitor until read by the ADC. Its output is a discrete PAM staircase (i.e., with contiguous pulses). Any variation in absolute timing comprises "jitter", which adds noise and distortion. Jitter is worst for high amplitude high frequency signals. Jitter must be less than 200 ps (picoseconds) for 16 bit sampling of a 20 kHz full amplitude sinewave and less than 100 ps for 20 bit in order to keep resultant noise below the quantization noise floor. Problems also arise if the hold signal "droops" before it is read due to capacitor leakage etc (the value must be held to within 1 mV during conversion). Typical circuits employ a junction field effect transistor JFET operational amplifier combined with a capacitor and require accurate high speed switching and acquisition time.

The analog to digital converter ADC is the most critical component. It must have a precision of 15 parts per million for 16 bit (65536 intervals) and 1 ppm for 20 bit. Speed and accuracy are required. Ideal error should be plus or minus 1/2 LSB to meet its nominal specification. Conversion time must be less than one sampling period T. Settling time or propagation errors can occur. Integral monotonic linearity is required—i.e., guaranteeing ± 1/2 LSB over the entire range with no missing output codes. Some drift with temperature may occur and a regulated power supply is needed. Code width ("quantum") is the range of analog input signal over which a given output value will occur—ideally it should be 1 LSB... With current technology, true 24 bit A/D conversion is difficult or impossible to achieve—this would correspond to a quantization error floor of -145 dBFS (dB full scale [not sure how this relates to the S/E of 104). If a full scale signal of 2 volts rms were assumed, the finest resolution would be ~0.1 µV, which is the approximate level of thermal noise. (Despite this limitation, internal processing may require longer word lengths). ADC's types include the successive approximation register SAR type but these have largely been replaced by oversampling systems.

Oversampling uses a high oversampling rate R say of 72 (e.g., 72 x 44.1 kHz = 3.1752 MHz) and low output word size of one or a few bits. They obviate brick wall filters and provide increased resolution. Sigma delta modulation conversion SDM is employed and a decimation filter to downsample (i.e., keep only a fraction of the input pulses) the oversampled stream.

Steps in recording include preemphasis equalization (boosting) of high frequencies prior to storage to improve their S/N ratio (e.g., up to a 10 dB boost maximum, with time constant T of 50 µsec providing a 6 dB/octave slope from 3183 to 10610 Hz—a corresponding 15 µsec deemphasis is used on playback). Also multiplexing (to combine multiple channels in a single serial stream), adding redundant parity data for error detection and correction, perform interleaving (to spread the data widely in the bitstream in order to improve error recoverability), grouping data into frames which include synchronization and data headers [e.g., for searching]), and channel coding.

Channel coding (channel modulation) creates the actual modulated signal (channel code) which is stored on the media. The A/D converter's outputted 0's and 1's are usually not directly stored but are converted to a more efficiently encoded form which also incorporates error detection and correction capabilility, etc. Channel codes for audio are ususally combined with a clock pulse to make the encoded waveform self-clocking (i.e., the time sequence of the pulses can be reconstructed independent of the playback rate)—adding clocking data of course increases the final channel bit rate and reduces bit density. T_minis the minimum allowable time interval between transitions (i.e., 0 to 1 or 1 to 0) in the modulated (final) recorded channel code and is determined often by the storage medium. It determines the highest frequency that can be stored (higher T_min allows higher maximum frequency). Code data storage efficiency decreases with the addition of increasingly accurate (and therefore frequent) clocking data so the relative requirement for each must be weighed. The density ratio DR is the ratio between T_minand the length of a single bit period T (the shortest time interval between transitions of input data to be modulated), i.e. DR = T_min/T. The T_max is the longest interval of time between transitions allowed in the modulated signal—increasing clocking accuracy lowers T_max. The window margin T_w (phase margin or jitter margin) reflects the tolerance for errors in locating the transition (or resistance to jitter, higher is better). The product of DR and T_min is the Figure of Merit FoM (higher is better). For efficiency, transitions between values are usually stored rather than absolute PCM values. Coding must restrict DC content in many applications (which can be monitored by the digital sum value DSV). DC content reduces S/N ratio and can cause clock synch problems, degrading bandwidth.

Modulation techniques vary in clock rate, DC content, DR, FoM, etc. and include {p. 72, 77}:

Simple codes [each information transition is coded separately]. These include nonreturn to zero NRZ, nonreturn to zero inverted NRZI, binary frequency modulation FM, phase modulation PM (also called phase encoding, biphase level mod., or Manchester code), and modified freq. modulation MFM (also called delay modulation or Miller code, used in hard drives)
Group codes (wherein a code translation or lookup table is used to convert groups of m input data bits into output words of n bits (which incorporate parity error correction code etc.) The code rate and T_w for a group is m/n and is a measure of the coding efficiency (higher is more efficient). Group codes represent a type of run length limited RLL coding (and therefore the spacing of transitions may be a multiple of T). They specify a minimum d and a maximum k of 0's between successive 1's, expressed as (d,k). Group coding methods includes Group Coded Recording GCR, Three-Position Modulation 3PM (2,7), Eight to Fourteen EFM (8,14) [used in CDs], Zero Modulation ZM (1,3), HDM-1 [used in professional digital tape recorders], and 8/10 [used with DAT] etc.

Chapter 4: Digital Audio Reproduction
[Overview of steps involved in digital reproduction or playback]

Reproduction (i.e., playback) typically entails the following steps prior to D/A conversion:

Waveform shaping of the degraded recorded coded signal to reconstruct the recorded code,
Extraction of time based code to synchronize individual frames (data is held in a buffer and phase locked loops are used to reconstruct the exact timing),
Demodulation of the group code typically to NRZ code (i.e., conversion of group code to a simple readable code)
Demultiplexing to restore parallel structure of multiple channels etc.
Extraction and processing of error code, and error correction algorithms applied
Deinterleaving

Digital to Analog Converter D/A or DAC: These demand great precision. With 16 bits and a ± 10V scale, the voltage steps to be outputted are ~0.0003 V and are ~0.000001V with 24 bits. Problems include differential nonlinearity (wide or narrow codes causing missing codes, which are worst for low-level signals and can result in audible harmonic and intermodulation distortion) and nonmonotonicity. D/A converters must have greater dynamic range (bit depth) than the audio signal itself, a fast settling time etc. Testing should use test frequencies that are not correlated with the sampling freq. Types include weighted-resistor DAC (not efficient for manufacturing), R-2R ladder DAC (easier to manufacture). Zero-cross distortion (arising where the signal goes from positive to negative) can be audible and is aggravated by dithering. Some manufacturers use 18- or 20-bit conversion to convert 16-bit PCM code in order to improve fidelity, reasoning that signal digitization and processing steps should have a greater dynamic range than the final recording. Oversampling does not ultimately create new information—it makes better use of existing information. DAC low level distortion is best measured by the dB ratio of maximal signal to the broadband noise (0 to 20 kHz) when reproducing a -60 dB signal—this can allow the computation of the effective number of bits ENOB, which might be 14.7 for a nominally 16-bit DAC. In practice today, sigma-delta converters and high oversampling rates are used.

Output Sample-and-Hold S/H Circuit (Aperture circuit): This samples and holds the analog output from the DAC until it is stable, in order to allow the removal of switching glitches (transient oscillations arising as the various bits switch at different times etc.) and compensation for a frequency response anomaly called aperture error. The latter results in attentuation of high frequencies and arises from the finite and narrow pulse width. The output should be a perfect analog PAM staircase signal but instead it follows a lowpass sin(x)/x function [see below], in which the sharp corners of the staircase have been rounded. Aperture error can be minimized by making the pulse width (hold time) equal to 1/4 of the sample period f_s—this yields an attenuation of 0.2 dB at the Nyquist frequency. Alternatively, the digital lowpass filter can employ frequency compensation for aperture error and can prevent glitches, obviating the use of a S/H circuit in creating a proper PAM staircase function.

Output Lowpass (Anti-imaging or smoothing) Filter: This stage is analogous to the input anti-aliasing filter used with recording [but has been supplanted by digital filtering prior to DAC]. This filter must remove all frequency content above the half-sampling frequency, thereby converting the now analog PAM staircase waveform into a smoothly continuous final waveform. These must have a flat passband and highly attenutated stopband (including even the extreme high frequencies, where artefacts arising from digital processing are found), with a steep cutoff slope. Phase shifts must be minimized or corrected. Oversampling techniques have eliminated the need for brick-wall filters. Even though stopband frequencies for high-quality audio systems are inaudible, they must still be filtered out to prevent downstream "modulation" into audible frequencies.

The impulse response h of a filter (or of any other component or system) is the output resulting when an infinitesimally short maximal intensity impulse is input, and can fully characterize the filter's behavior [presumably it is analagous to the line spread function used in imaging]. It can be characterized in the time domain or frequency domain, related through the Fourier transform. In the frequency domain, the filter's graph is the transfer function TF [modulation transfer function MFT] and would ideally appear as a square wave maximal below the filter cutoff frequency f_c and zero above that value. Multiplying the frequency spectrum of a signal by a filter's transfer function in the frequency domain is equivalent to convolving the input time-domain signal with the filter's impulse response. [For more info on the fast fourier transform FFT, convolution etc., you might wish to Google these terms or search Wikipedia.] Continuous convolving, notated as x * h, is defined as:

but in practice discrete finite rather than infinite continuous techniques are employed (using a limited number of samples) to allow a finite number of calculations. An ideal brick wall filter in the frequency domain [i.e., 1 up to the cutoff frequency f_c and 0 for higher frequencies] has a sin(x)/x curve in the time domain (this function when appropriately scaled on the horizontal axis has a value of 1 at 0 and 0 at 1/f_s).

Digital Filter: These are used prior to DA conversion and have supplanted analog brick-wall filters. After D/A conversion, gentle low order analog filter removes remaining high frequency images. Usually these are finite impulse response FIR filters and can be simple in design when oversampling is used. For example, a 4x filter (oversampling ratio R=4) adds 3 interpolated intermediate sample points for every original value and employs a sampling frequency f_a of 176 kHz. Each intermediate sample is multiplied by the appropriate sin(x)/x coefficient so that ... in operation the effect is the same as a brick wall filter. Spectral images (sidebands) can be removed with a gentle analog filter because they do not appear except at a high frequency exceeding the high oversampling cutoff frequency f_a. A typical diigital filter utilizes a digital transversal filter with tapped delay lines, multipliers, and an adder. The multiplication coefficients are 12-bit values {p. 100}. FIR filters use a limited range of the sin(x)/x function centered about 0 and are therefore "finite" (the range over which samples are included is selected to maintain the error below the system's overall resolution, typically c. 100 coefficients, or 50 on each side of 0 on the x axis). A oversampling rate of 8x is the upper limit of usefulness with traditional D/A converters. Digital filters are not affected by temperature. Oversampling also allows spectral noise shaping because the noise power is shifted to higher frequencies. Noise shaping is done by requantizing the oversampled signal prior to D/A conversion. E.g., 28 bit words from the filter are rounded off and dithered to create the most significant 16 bit words. The 12 least significant bits are typically delayed 1 sampling period and subtracted from the next data word. This noise shaping decreases the noise floor by 7 dB in the audio band (though it increases noise in higher bands).

Alternate coding and quantizing methods: Though most systems use linear PCM (LPCM) coding, some use companding (dynamic compression and decompression employing many differing quantization intervals in the low amplitudes versus the high amplitudes). These may be floating point or logarithmic etc., and may be adaptive and change their method in response to the character of the signal. Block floating point systems allow data reduction. S/N may increase for small signals. Examples of companding systems are the µ-law (e.g., µ-255 for speech applications, defined by the CCITT [International Telegraph and Telephone Consultative Committee] for telephony etc.) and A-law companding (logarithmic, also defined by the CCITT, used for speech—these use 8 kHz sampling and final bitrate is 64 kbps [kilobits per second]). Differential PCM... Delta modulation is a true 1-bit method using a very high sampling frequency but fails to perform well in high fidelity applications (though sigma-delta modulation works very well and is used with the SACD). Other apps include Adaptive delta modulation ADM (including continuously variable slope delta modulation CVSDM), companded predictive delta modulation CPDM, and adaptive differential pulse-code mod. ADPCM (used by the CD-ROM/XA and widely in telecommunications, QuickTime and Windows.aiff and .wav files, and can incorporate perceptual coding).

Timebase correction: Use phase-locked loops to resynchronize a receiver with the transmitted channel code's clock data. Any variation can be characterized as jitter. Jitter manifests as broadband noise (from random jitter) or a single spectral line (from periodic jitter). The oscilloscopic eye pattern displays success transitions and can demonstrate jitter. Jitter can arise from the interface as well as from sampling, and in the storage media itself... Sigma delta converters can be very sensitive to jitter—it is not usually a problem in well designed audio equipment.

Chapter 5: Error Correction
[Mathematical and Physical Technologies for Error Detection, Correction, and Prevention]

Errors are inevitable but by means of robust error correction systems, CD and DVD can have uncorrectable error rates as low as that specified for computers, i.e., 10^-12 (one uncorrectable error in one trillion). Audio applications do not require this degree of accuracy.

Sources of error: Include dropouts from the media (oxide wear, fingerprints, scratches), signal degradation (reflection, intersymbol interference, impedance mismatches, RF interference).

Measures of error: The burst length is the maximum number of adjacent erroneous bits that can be fully corrected. The bit-error rate BER is the number of error bits per total bits. Optical disk systems can handle BERs of 1:100000 to 1:10000. The block error rate BLER is the rate of block or frames per second having at least one incorrect bit. The burst error length BEL is the number of consecutive blocks in error.

Methods of correction: Goal is to introduce redundancy to permit validity checking and error detection, error correction code ECC to replace errors with calculated valid data, and error concealment to substitute approximate data for uncorrectable invalid data. Redundancy includes repeating the data, adding single-bit parity bits (to check if odd or even), checksums (e.g., weighted checksums computed modulo 11), and cyclic redundancy check code CRCC.

CRCC uses a parity check word obtained by dividing a k-bit data block by a fixed number (generation polynomial g) and appended to the data block to creat the transmission polynomial v. When the data u is received, it is divided by the same g, and the result subtracted from the original checksum to yield the syndrome c: a zero sydrome indicates no error. Error correction can be accomplished using mathematical manipulation and modulo arithmetic {p. 138}... Polynomial notation is the standard terminology in the field: e.g., the fixed number 1001011 (MSB leading) is represented as 1x2⁶ + 0x2⁵ + 0x2⁴ + 1x2³ + 0x2² + 1x2¹ + 1x2⁰ or 2⁶ + 2³+ 2¹ + 2⁰. CRCC is typically used as an error pointer and other methods are used for correction.

Error correction techniques employ block codes having row and column parity (CRCC are a subclass of linear block codes), convolutional or recurrent codes (which introduce a delay), and interleaving including cross-interleaving.

Reed-Solomon R-S codes (Irving Reed and Gustave Solomon 1960) employ polynomials derived from Galois fields to encode and decode block data. They are a subclass of q-ary BCH codes which are a subclass of Hamming codes {p. 153}. They are especially effective in correcting burst errors and are widely used in audio, CD, DAT, DVD, direct broadcast satellite, and other apps. Cross-Interleave Reed-Solomon Code CIRC is used in CDs. It includes the use of C2 then C1 encoders (C1 then C2 on decoding). The C1 level of CIRC is meant to correct small, random errors. The C2 level corrects larger errors and burst errors. Interleaving is used between the C2 (28,24) and C1 (32,28) encoders and deinterleaving is needed on decoding. (28, 24) means 28*8 bits are output for the original 24*8 bit input and the final output is 32 8-bit words of which 8 are for "parity" and 24 are actual data. The cross-interleaving stores one C2 word in 28 different blocks spanning a distance of 109 blocks using delay lines etc., crossing the data array in two directions (thus "cross"). With audio CDs, CIRC can correct burst errors up to 3874 consecutive erroneous bits or symbols (2.5 mm track length) and can well conceal 13,282 error bits (8.7 mm) and marginally conceal 15,500 bits. The CD standard requires a block error rate BLER [the number of data blocks that have any bad symbols at the initial C1 error correction stage] of less than 220 per second averaged over 10 seconds (50 would be typical). There are 7350 blocks/sec on a CD (a block or frame, derived from 24x8=192 bits input data, is 32x8=256 bits output to modulator). The resulting CD data rate = 1.4112 Mbps (input data rate, not including parity bits added by CIRC and EFM).so the maximum Redbook BLER of 220/sec (averaged over 10 sec) allows 3% of the blocks to be erroneous. E12 is the rate of single symbol errors at the C2 encoder, which are correctable. E22 expresses the rate of double symbol errors at the C2 encoder—these are the worst but still correctable errors [the first number is always the number of errors and the second number is always the decoder level]. E32 errors are triple bit errors at C2 and are uncorrectable and require interpolation—they should not appear in a new CD and are unacceptable in a CD-ROM. Other measures of error are the E11, E21, E31. The burst error count BST combines E21 & E22 and expresses the number of consecutive C2 block errors that occur in excess of a threshold value such as 7. A new CD might typically have a raw bit error rate of 1E-5 to 1E-6, BLER = 5, E11 = 5, E22 = 0 and E31 = 0 and should never have E32 uncorrectable errors. Digital audio data can be copied with high reliability.

Error concealment includes interpolation (may be low or high order, zero order simply holds the last good value) and muting.

Chapter 6: Magnetic Tape Storage

Digital tape recording requires a much higher bandwidth than analog. Saturation recording is used. Base thickness can be less than for analog tape because print-through is not a problem: typically 20 µm, and oxide thickness of 5 µm. Magnetic particles of higher magnetic energy levels are used, allowing higher packing densities. DAT tapes have a coercivity of 1500 oersteds. The binder is much more durable than analog tape, thus minimizing dropout. Intersymbol interference ultimately limits the maximum recording density. With longitudinal recording, the areal density in kilobits per in² is critical and increases as individual track width decreases and as particle length decreases ... Hard drive areal densities of 10 Gbpi² have been achieved and 40-100 are anticipated. A tape can consist of multiple data tracks. Vertical magnetic recording will play a larger part in the future. Heads can be staionary or rotating.

DASH is digital audio stationary head, used for professional 24- and 48-track recorders, a 1/4 or 1/2" tape width. Note that the tracks recorded may not correspond to the channels of data (e.g., there may be 12 tracks per channel and as few as 2 channels). The linear packing density per track is 38.4 kbpi using 25.6 flux reversals per inch and HDM-1 modulation code. Tape speeds can be 30 ips, 15 ips, and 7.5 ips for 48 kHz sampling and for 44.1 kHz sampling they can be 70.01 cm/s, 35 cm/s, and 17.5 cm/s.

Rotary head recorders allow higher recording rate but are difficult to use for editing: Analog videotape recorders are used to record standard NTSC (National Television Systems Committee, 30 [29.97 color] frames per second of 525 lines interlaced, scan rate of 15.75 kHz) and PAL/SECAM (Phase-alternation line/sequential and memory standard [European] 25 frame /sec at 626 lines) signals. Because of interlacing in NTSC, two fields make up one video frame. In VHS, analog FM modulation is used and luminance (brightness) and chrominance (color) are recorded... Professional digital audio processors can record digital audio tape masters using NTSC format and the professional U-matic format... Professional digital video formats... Tascam modular digital multitrack...

Chapter 7: Digital Audio Tape DAT
[Summary of Audio and Tape backup uses]

DAT was the first mass-market digital tape recorder. R-DAT (rotary as opposed to stationary head DAT) has become the standard, introduced March 1987. The 30-mm diameter head rotates with a track angle of c. 6.5 degrees wrt horizontal. DAT supports sampling frequencies of 32, 44.1, and 48 kHz (some support 96 kHz) with 2 or 4 channels and 12- or 16-bit quantization. With sampling freq. of 48 hKz and 16-bit quantization 2-channel stereo, the audio data rate is 1.536 Mbps (subcode, ECC, and automatic track following ATF code raises the overall maximum transmission rate to 2.77 Mbps). The DAT cassette has 3.81 mm ("4 mm") tape with total thickness 13 µm and oxide thickness 3 µm. DATs can play 120 minutes of audio (16-bit 44.1 kHz stereo) on 60 meters tape, thus "120M"—this equals 120x60x44100x16x2/8 or 1.27 GBytes of audio data per tape or 2.2 GBytes overall information per tape including overhead. [There seems to be some confusion as to whether commercial tapes are listing their length in minutes or meters.] Thinner tapes etc. can lengthen recording capability. Each channel bit occupies 0.67 µm, and overall areal density is 114 Mbits/inch², with track widths of only 13.6 µm. The drum rotates at 2000 rpm. Tape deterioration occurs after about 200 uses. Tape speed is 8.15 mm/sec (c. 1/4 inch/sec) but the relative speed of tape to the head is much higher at 3 m/s. 8-10 modulation is used with Double Reed Solomon code, and final data are written to tape with NRZI modulation.

Consumer audio DAT units incorporate a Serial Copy Management System SCMS circuit preventing multigenerational copies carried on a S/PDIF interface [Sony/Philips digital interface for consumer connections) if the copy-inhibit flag is set in the subcode bit stream. The professional AES3 [or AES/EBU Audio Engineering Society/European Broadcasting Union] connector for professional DAT does not carry SCMS data so there is no copy protection.

The use of DAT for data storage led to the Digital Data Storage DDS format. The same transport is used and 1300 [uncompressed] MB can be stored on one [DDS120] tape with an error rate of 10^-15 (DDS utilizes a third layer of error correction). It can be rapidly searched using ID code in the subcode. [This text does not discuss the DDS-2 format which can store 4 GB uncompressed data on a 120m ?meter/minute tape].

Chapter 8: Optical Disk Storage
[Summary of physics of light and overview of optical storage techniques including cinema]

Light theory: Light provides high bandwidth [click for EM spectrum]. Visible wavelengths range from c. 400 (violet, f= 7.5E14 Hz) to ~750 nm (red, f=4E14 Hz), but optical storage uses from c. 600 to 900 nm and fiber optics use c. 850 to 1600 nm. Speed of light c=3E8 m/sec (and f = c/wavelength). Refraction (deflection) occurs when light passes between media with differing index of refraction (measured for a given wavelength as the ratio of its speed in a vacuum and its speed in the medium). Refraction is smallest for red light, greatest for violet. For light incident on an interface at or below the critical angle (defined by Snell's law), there is total reflection.

Diffraction of light occurs when it passes through a narrow aperture small in relation to the wavelength (e.g., one half the wavelength). First observed by Joseph von Fraunhofer. A wavefront passing through a narrow circular aperture produces an Airy pattern or disk {named for George Airy 1835, p. 223} in which the zero-order maximum (containing 83% of the total light) appears straight ahead on the original axis and is surrounded by circular rings of higher order maxima and minima produced by constructive and destructive interference.

Digital Audio: Airy pattern

As the aperture through which the light passes decreases, the angle subtended by the first-order maximum cone increases. The effect of this phenomenon is that to optimally view a small object, its reflected [or transmitted in the case of microscopy] and diffracted light wavefronts must be collected with a wider angle lens system. The numerical aperture of a lens system in part expresses how wide the viewing angle of a system is. [The N.A. = i sin q where i is the index of refraction of the medium in which the lens is working, and q is one half of the angular aperture of the lens, i.e., the angle extending in both directions from the central axis to the limit of light gathering.

When attempting to focus a laser beam to a narrow spot, the result is an Airy pattern whose size is determined by the light wavelength (smaller wavelength yields smaller spot) and the numerical aperture of the focusing lens (larger NA yields smaller spot). Laser pickup optics are diffraction-limited. An optical microscope cannot resolve the 3-D details of the pits of a CD, must use a scanning electron microscope. The optical resolution of a system is the spacing between two small objects producing Airy patterns that can just be resolved (occurs when the first order peak of one coincides with the 1st-order minimum of the other, the Rayleigh criterion). The resolution of a lens is determined by its numeric aperture. A lens acts as a lowpass spatial filter since its MFT causes spatial frequencies above a certain cutoff frequency to be attenutated. The spot size d of a pickup (read) beam is defined as the half-intensity diameter (i.e., its full width at half maximum FWHM), or 0.61 x wavelength/NA. Tolerances for deviations are highly dependent on NA; for example, the disk thickness tolerance is proportional to NA^-4. For a CD, wavelength lambda is 780 nm (infrared or possibly deep red), NA is 0.45 (a compormise balancing needs for resolution versus need for tolerances), and spot size [i.e., width] is therefore c. 1 µm (1000 nm, measured at FWHM, or 2.1 µm using the Rayleigh criterion). For DVD, the wavelength is 635 or 650 n (red-orangish), NA is 0.6, and spot size is c. 600 nm [presumably FWHM].

NA and wavelength also determine track pitch (average distance between tracks: 1600 nm for CD and 740 nm for DVD), cutoff frequency, etc. and ultimately disk playing time. The spatial cutoff frequency f_co (=2·NA/wavelenth) for CDs is 1.15 line pairs per µm, and more closely spaced lines cannot be resolved. In CDs, the shortest allowable length for a pit or a land (area between pits) is 0.833 µm (833 nm; thus the maximal spatial frequency that might be recorded is 0.6 lp per µm). Assuming a track velocity of 1.2 m/s, the cutoff temporal frequency for CDs is 2v·NA/wavelenth or 1.38 MHz [it is unclear to me how this correlates with the stated channel bit rate of 4.3218 Mbps]

A light beam propagating along the transmission axis (the Z-axis direction) can be unpolarized or unpolarized—the electrical field component is usually described but the concepts also apply to the magnetic field component . In linear or plane polarization, E varies in only one plane [the plane defined in the XYZ space projects as a line in the XY plane] and the Ex and Ey component oscillations are therefore in phase.

The property of birefringence is due to unequal index of refraction (and velocity of propagation) in different planes—such a material is anisotropic. The optic axis is the direction along which no birefringence occurs. Light passing through a birefringent material is doubly refracted an emerges as the ordinary ray and a displaced extraordinary ray. Birefringence can be an undesired property of the plastic of disk substrates and must be minimized, but it can be put to good use in optical pickups. Linearly polarized light passing through a quarter wave plate QWP made from a birefringent material causes exiting light (arising from the ordinary and extraordinary rays) to have Ex and Ey components that are 90 degrees out of phase ("phase quadrature")—this produces circular or helical polarization. The QWP is designed for a specified wavelength {p. 228}. The E vector rotates in a helix about the Z-axis. Angles of rotation other than 90 degrees produce elliptical polarization. Linearly polarized light striking a QWP will be circularly or elliptically polarized depending on the angle between the plane of polarization of the incident light and the orientation of the optical axis of the QWP. The optic axis of a QWP parallels its faces, and if the incident beam linear polarization plane makes an angle of 45 degrees with the optic axis of the QWP, circular polarization results (click for more).

Most optical media use a spiral groove (some systems other than CD and DVD use concentric tracks). There is no contact with the media by the pickup system. The medium must present 2 states for binary data, represented as phase changes, polarization changes, or altered intensity of reflected light. Pits diffract the beam, reducing its intensity. The disk system requires sophisticated servo mechanism to maintain position. There must be focus tolerance of c. 1 µm.

Optical disk performance: The recording bit density of optical media is typically about 100 times that of the same size magnetic media. They have longer life expectancy, are less susceptible to heat, humidity, magnetic fields, and head crashes. A CD should last for 100 years [but the life expectancy of CD-R is less well established]. CDs and DVDs are 120 mm in diameter. CDs hold 650 MBytes of formatted data. Single-layer single-sided DVDs hold 4.7 Gbytes, transfer data at 10 Mbps. Raw DVD error rate is c. 10^-6; uncorrectable error rate is 10^-13 (compared to 10^-6 for a floppy diskette).

Playback-only commercially-recorded CDs and DVDs:
Commercially distributed playback-only CDs and DVDs have permanently formed pits cut in the metalized reflective data layer to a depth calculated to optimally decrease the intensity of the laser light (one quarter wavelength). The data layer is sandwiched between the transparent plastic substrate and a thin top protective layer. Because surface scratches are out-of-focus with respect to the data layer, they have less effect than if they were in the data layer. CD and DVD act as reflective phase gratings similar to a diffraction grating. Light diffracted by the grating consists of a single zero-order and multiple first-order beams, which partly overlap. Light returning from the pit is a half wavelength out of phase wrt light from the adjacent land and therefore destructively interferes with it and diminishes the intensity of the returning light. Rays diffracted by the pits also at least partially fall outside the lens aperture and thereby also reduce the intensity of light gathered by the lens.

Write-Once WO Optical Storage:
WO including CD-R and DVD-R systems use a variety of ways to produce the necessary alterations in effective reflectivity {p. 232}. CD-Rs use a dye-polymer which is absorptive at the wavelength of the recording laser. Heating by the laser produces a depression in the dye layer and a deformation of the reflective layer beneath (i.e., further away from the laser). Others WO systems use an irreversible phase change (crystalline high reflectivity changing to amorphous low reflectivity), bubble creation, or a texture change. Pit ablation requires a laser power of c. 10 mW.

Magneto-optical MO Erasable Optical Storage:
This includes magneto-optical MO recording {p. 234}. With this technique, magnetization and demagnetization of a vertically oriented magnetic medium (which provides high particle density) by a weak magnetic field are assisted by laser heating to the Curie temperature. This greatly reduces the local magnetic coercivity (a measuree of the magnetic field strength required to induce a change in magnetization), thus allowing only the heated area to be magnetized or demagnetized. Reading is done with a lower power laser (which does not heat to the Curie temperature). It utilizes the Kerr effect, in which polarized light is slightly rotated by a magnetic field. The angle of rotation of a read laser beam is monitored. MO disks can be rewritten millions of times and should be more stable and reliable than conventional magnetic media... The MiniDisc format uses a small MO disk providing 74 minutes of recording time on a 2.5 inch disk.

Phase change and Dye-polymer Erasable Optical Storage:
CD-RW utilizes phase-changes between amorphous (low reflectivity) and crystal (high reflectivity). Depending on the intensity of the recording laser beam, the media is either melted and rapidly cooled to the amorphous state or more gradually heated to below the melting point and cooled to the crystalline state... Dye-polymer methods are also in use...

Optical storage for Digital audio for Cinema:
Old-style purely analog cinema consists of optical frames, a dual row of optical (no longer magnetic) stereo variable area SVA analog sound adjacent to the image frames, and an enclosing row of sprocket holes. Digital audio signal can be added in several ways on film to allow for a variety of playback capabilities while preserving the analog SVA for compatibility.

Dolby Digital AC-3 is positioned between the perforation sprocket holes and has 6 channels (left, right, center, left surround, right surround, and subwoofer or Low Frequency Effects LFE 3-125 Hz).

In the Digital Theater System DTS system, DTS timecode tracks are placed on the film adjacent to the analog stereo SVA, and a separate external CD-ROM provides the audio synchronized to the timecode. It provides "5.1" data-compressed audio channels (left, right, center, left surround, right surround, and subwoofer for 20-80 Hz). The compression algorithm is apt-X100...

In the Sony Dynamic Digital Sound SDDS sytem, data is placed outside the rows of sprocket holes. It encodes 8 channels: (left, left center, right, right center, center, left surround, right surround, and subwoofer for 20-80 Hz). ATRAC compression algorithm is used.

Chapter 9: The Compact Disk
[Details of CD-Audio and Other CD Formats]

The CD was invented jointly by Philips Corporation in the Netherlands (optical disk technology) and Sony (error correction techniques)—they proposed the format in 1980, introduced it into Europe and Japan in 1982 and the USA in 1983. This originally introduced format was CD-Audio (CD-A, CD-DA or "Red Book" CD, as specified in the International Electrotechnical Commission IEC 908 standard available from the American National Standards Institute ANSI). Additional CD formats were subsequently introduced and often named for the color of their book: CD-ROM 1984, CD-i 1986, CD-WO [CD-R] 1988, Video-CD 1994, and CD-RW 1996. The following comments apply to commercially manufactured CD-A or to CD formats in general, unless otherwise indicated.

Physical dimensions and specifications:
A CD disk is 120 mm in diameter (60 mm radius), with a hole 15 mm diameter (7.5 mm radius) and 1.2 mm thick {p. 247}. Starting at the hole edge at 7.5 mm radius, there is a clamping area extending from 7.5 to 23 mm radius [this is partly clear and partly metalized, and may include a visible inscription stamped by the manufacturer], then a 2 mm wide lead-in area extending from radius 23 to 25 mm (containing non-audio information used to control the player—digital silence in the main channel plus the Table of Contents TOC in the subcode's Q-channel), then the 33 or 33.5 mm wide data area (program area) extending from radius 25 to a maximum of c. 58, a lead-out area (which contains digital silence or zero data) of width 0.5 - 1 mm from radius starting maximally at c. 58 mm, and finally a c. 1 mm unused area extending to the outer edge. With CD-R {p. 282 and below}, the relevant inner and outer radii are as follows: clamping area 15 - 22.35 mm, PCA & PMA 22.35 - 23 mm, Lead-In 23 - 25 mm, Recorded area 25 - 58 mm maximum, Lead out ends at 59 mm maximum.

In commercial pressed CD-A disks, pits are impressed by injection molding into the top surface of the plastic polycarbonate substrate (which has a high index of refraction of 1.55). This data layer is then coated with a 50-100 nm metalized layer to provide reflectivity, a 1000-3000 nm plastic or lacquer protective layer is added, and finally the c. 5000 nm label is printed. Pits appear as bumps to the laser pickup beam. Pits are c. 600 nm wide and a typical disk might hold 2 billion of them. Pit depth is approximately one quarter of the wavelength of the pickup light beam in the substrate, thus c. 500/4 = 125 nm (ranges from c. 110 - 150 nm). Pit (and land) lengths vary from a minimum of 830 nm - 970 nm (representing 3T or 3 times the minimum channel bit period) to a maximum of 3000 to 3600 nm (representing 11T), the actual length depending on the track linear velocity which can vary from 1.2 to 1.4 m/s. Pits are placed along a spiral track pattern (there is no groove) starting on the inside diameter of the disk and spiralling outward. The data spiral is about 3 miles (5800 m) long, extends across as much as 35 - 35.5 mm of signal surface radially over 22,188 revolutions, and can extend to within [2 - ] 3 mm of the outer edge of the CD. (Errors tend to be greatest in the outer portions, and since not all recordings occupy the full disk, recording was designed to begin from the inside). Adjacent "tracks" are spaced 1600 nm apart center to center. (See p. 270-275 to learn more specifics about manufacturing and replicating audio CDs).

The optical pickup includes a laser beam (typically a 0.5 mW AlGaAs laser diode), which irradiates from below through the clear plastic substrate) and a lens system to detect reflected light. Near-infrared laser light of 770 to 830 nm wavelength is used, typically 780 nm wavelength in air (c. 500 nm within the polycarbonate substrate). The beam width is c. 800,000 nm at the bottom edge of the substrate (thus rendering the beam relatively insensitive to small surface imperfections) but becomes focused by refraction to c. 1000 nm width FWHM at the level of the pits, which is thus a little larger than the pit width of 600 nm. CD-A pits are read at a constant linear velocity CLV (thus the angular velocity of a CD must change continuously as the track diameter increases, from c. 500 rpm at the inside to c. 200 at the outside). A particular CD must use a fixed CLV, typically 1.3 m/s, but different disks can use from 1.2 to 1.4 m/s (lower CLV provides higher density and longer playing times). In any case, the channel bit rate must stay constant for CD-A. Light encountering a pit is reflected with a phase difference of c. 1/2 wavelength and therefore destructively interferes with light reflecting from a land, causing a decrease in reflected light intensity. The overall intensity of light returning from a long pit is c. 25%, which includes the reduction due to diffraction of some of the light outside the lens aperture.

Logical and Data Specifications: CD-A stores two 16-bit data words sampled at 44.1 kHz (PCM data), for an audio data rate of 1.41 million bit/sec. Additional error correction, synchronization, and modulation bring the channel bit rate (actual rate of bits recorded) to 4.3218 Mbps. The maximum standard capacity for CD-A is 74 minutes 33 seconds or 6.3 Gbits audio data or 783 million bytes audio data (maximum playing time can be increased somewhat to c. 80 minutes by modifying the standards slightly). Each pit edge (i.e., all transitions from land to pit and pit to land) represents a binary 1—the intervening lands and the internal areas of the pits represent 0's.

Music Performance: Standard CD-A (PCM 16-bit per stereo channel 44.1 kHz sampling) yields in typical players a frequency response from 5 Hz to 20 kHz of ±0.2 dB. Dynamic range exceeds 100 dB, signal/noise ratio (S/N) exceeds 100 dB, and channel separation exceeds 100 dB at 1 kHz. Harmonic distortion at 1 kHz is < 0.002%, and wow and flutter are essentially unmeasurable. With digital filtering, phase shifts are less than 0.5 degrees. Linearity is within 0.5 dB at -90 dB. The performance of CD players can be measured by the AES17 specification.

Frame Encoding: The CD-A PCM data stream is encoded into frames which contain additional information. Six 32-bit stereo PCM samples (i.e., 16-bits for each of 2 audio channels, totalling 192 audio data bits or 24 8-bit words) are encoded by Cross-Interleave Reed-Solomon Code CIRC and combined with the 64 CIRC parity bits (8 words) resulting, 8 subcode bits (P, Q, R, S, T, U, V, W), and 24 synchronization bits (used to indicate the start of a frame) to produce a 288-bit frame. This frame is presented for EFM modulation (all but the synch word are EFM modulated). As a result of interleaving, the 6 audio samples included in a frame have arisen from different (discontinuous) times as discussed above. The P and Q bits of the subcode data specify the number of selections or "tracks" on the disk (99 maximum), their beginning and ending points or times, and index points (up to 100 within each track or selection). The frame rate for an audio CD is 44,100/6 = 4350 frames/sec. The subcode block rate (98 frames per block) is 7350/98 = 75 blocks/sec (see below).

Modulation: CDs use Eight-to-Fourteen Modulation EFM, which provides a theoretical density ratio of 1.41 and in actual practice about a 25% greater data storage capacity [compared to what?]. To each 14-bit EFM output word are added 3 merging bits (which maintain proper run length, suppress DC content, and aid clock synchronization), so 8 frame bits become roughly 17 channel bits (excluding the 24 synch bits). More precisely, 288-bit frames are modulated into 588 channel bits as follows: The fixed synch word of 24 bits is unchanged, the 8-bit subcode word becomes a 14-bit EFM word, the 24 8-bit audio data words (192 bits total) become 14x24=336 EFM bits, the 8 parity 8-bit words become 8x14=112 EFM bits, and 34x3=102 merging bits (required to merge the 33 14-bit words and the synch word) are added. This EFM code is modulated to NRZ code (a pulse for each 1 in the bitstream) and this NRZ is converted to NRZI (in which there is a transition for each NRZ pulse) {p. 252}. The pits and lands are created from these NRZI transitions, and by design are no shorter than 3 and no longer than 11 channel bit periods T. Each 16-bit audio sample has become 49 channel bits, so the original audio data rate of 1.41 Mbps translates to a channel bit rate of 1.41x49/16=4.318 Mbps (this is the actual rate of pit/land transitions recorded on the CD). EFM modulation makes the system relatively tolerant to jitter (50 nsec).

Subcode: For CD-A, the P bit or channel is a on/off flag which indicates the start and end of each track as well as the lead-in and lead-out areas and was intended for simple audio players which did not have full Q-channel decoding. Q-channel data contains the track number and index numbers, timecodes, the TOC (in the lead-in area), and other nonaudio data. Specifically, subcode Q-bit data from successive frames is pooled—98 frames are pooled to make up a subcode block. In the usual audio CD Mode 1, the 98 resulting Q-bits in a subcode block encode the following:

The mode (1, 2, or 3) of the "Q-data" (which begins with the track number, below)
The number of audio channels (2 or 4)
Audio vs. data content flag
Copy protection flag
Preemphasis flag
Track number TNO (at least in the program area, but 0 in the lead-in area indicating the TOC, and AA in the lead-out)
Index value (X or point): This is 0 in intertrack pauses or pregaps, 1 as the first value within a track (with optional in-track index "points" up to 100) and 1 in the lead-out. Pregraps or delays are at least 2 seconds. In the TOC area, the TOC is assembled from the Index X and gives track numbers and their absolute starting times (points) in min:sec:frames, the times of a multiple disk set, and the number of the first and last tracks on the disk. The TOC is repeated continuously in the lead-in area.
Program time P-time elapsed within the track (in min:sec:frame—these counters also count down the remaining time during a pregap pause)
Absolute time A-time since the beginning of the disk (min:sec:frame, also shown as amin:asec:aframe)

You might wish to search Google etc. for more information on this topic (such as was formerly found at http://www.disctronics.co.uk/). This would include using the originally unused subcode storage areas (R-W subchannels) for inclusion of text and graphics—see the textbook regarding the other 2 modes (which can provide the catalog number such as the UPC/EAN codes and the International Standard Recording Code ISRC resp.).

Player Optical Design: Optical pickups (for reading the CD) include the "three-beam" and "one-beam" designs. The three-beam design is briefly summarized as follows. The beam from the laser photodiode passes through a diffraction grating, which divides the beam into a center beam (for reading data) and side beams (the first order of which are used for tracking; these 3 major beam components comprise the "3" beams). This light passes through a collimator lens to a polarization beam splitter PBS (which is at this point transparent to this beam), then a QWP, a mirror, and an objective lens with NA = 0.45 and which is positioned to maintain up/down focus and lateral tracking on the data layer. Reflected/diffracted light returning from the disk passes again through objective lens and the QWP so that it is now 90 degrees polarized relative to the original beam, and this allows it to be reflected into collective and cylindrical lenses rather than be transmitted back through the PBS . The light then passes into the four-quadrant detection photodiode. Signals from the quadrants of this photodiode are summed and compared to compute the correction signals for the servo-driven autofocus system (up/down) as well as to provide the desired modulated signal. The cylindrical lens intentionally introduces astigmatism into the system—the beam produces uneven quadrant illumination except at the precisely desired up/down focal distance. The autotracking system utilizes the first order side lobes of the central beam to provide tracking information. The side lobes return equal signal to the left and the right quadrants of the photodiode when well centered and therefore falling equally on each side of the pits, and asymmetrical light intensities when off-track. The pickup mechanism is mounted on a sled which move radially (one-beam systems are mounted on a pivoting arm).

Player Electrical Design: The photodiode signal is applied to a phase-locked loop to recover the timebase information. The EFM code is recovered from the radiofrequency RF signal from the photodiode by detecting the NRZI signal, and converting it to NRZ and thus to EFM. Synch words are identified and merging bits discarded. The EFM code is demodulated with a lookup table and the data stored in a buffer memory. Clocking data is used to insure uniform data rate even if the incoming data has an irregular rate due to flucutations in rotation rate, etc. The CLV of the CD is varied to keep the amount of data in the buffer at an appropriate level. The demodulated signal is sent to the CIRC decoder. Error correction is performed where possible and if necessary interpolation and error concealment added. A digital antialiasing filter, oversampling typically at 8x, and sigma-delta converter are commonly used to optimize D/A conversion. The subcode data is interpreted to extract the start of a track, etc. (see above).

Overview of Other CD Formats: These comprise a family with complex interrelationships {p. 276} and include CD-ROM (Yellow book), CD-ROM/XA (extended architecture: audio, video, graphics and computer data), CD-i (interactive, like CD-ROM), Photo CD, CD-R (recordable, Orange Book), CD-RW (Rewritable), Video CD (white book), enhanced music CD (blue book), CD-Plus (multi-session or CD-Extra, blue book), Hybrid CD-R (multisession), CD+G (audio with some graphics), CD-MIDI (audio plus MIDI), CD-i Bridge (green book), Mixed mode (audio plus CD-ROM), CD-MO (magneto-optical, orange book) etc.

CD-ROM (Yellow book): This format, introduced in 1983, was derived from the CD-A standard, but defines a general data storage format not tied to a specific application such as audio and has a different data format. As with CD-A, 98 frames (each storing 24 data bytes) are grouped in one data block having 24x98=2352 overall data bytes. Mode-1 CD-ROM holds up to 682 million bytes of actual user data (333,000 2048-byte blocks, or 650 MB where 1 MB = 2²⁰ bytes). Therefore, CD-ROM stores a smaller amount of actual data than CD-A's 783 million audio bytes = 746 MB). With mode 1 (which provides the greatest error correction), the overall block size of 2352 data bytes includes 2048 bytes of actual user data, synch pattern (12 bytes), time and address headers (3 bytes), mode (1 byte, 1 or 2), extended error detection EDC (4 bytes) and ECC (276 bytes), and 8 bytes spare {p. 277}. Addresses are stored as playing times as in CD-A (min:sec:block). Mode 1 has extended error correction compared to CD-A, providing an uncorrectable error rate of 1E-15. The preferred standardized file structure of CD-ROM (recommended by the "High Sierra Group") was adopted by ISO (International Standards Organization) as ISO/DIS 9660. Level 1 ISO 9660 is limited to DOS style 8.3 filenames but Level 2 ISO 9660 allows long file names. ISO 9660 extensions for other operating systems include the Rock Ridge extension for Unix, Joliet (a Microsoft multiplatform standard), Apple, HFS (intrinsic Apple Hierarchical Filing System), etc. CD-ROM players typically include a D/A converter to allow playing of CD-audio. Some but not all are capable of interpreting multisession recordings (see below).

CD-ROM/XA (extended architecture, a yellow book extension defined in the white book) defines a single track format allowing data, compressed audio, compressed video, and images. The Photo CD and Video CD are subtypes of this...{p. 278}. Not all CD-ROM drives can read CD-ROM/XA.

Hybrid/Mixed CD formats: A confusing array of hybrid variants exist and seem to have inconsistent definitions. Toshiba defines Mixed Mode CDs as containing multimedia (CD-ROM) data in TRACK 1 and audio in TRACKS 2 - 99. They state "This format requires the audio consumer to manually avoid TRACK 1 during playback and is therefore not considered a viable enhanced CD format." To counter this problem, Pregap or Track zero CDs "place the multimedia CD-ROM data in an expanded 'pregap' area of the disc prior to TRACK 1, with the CD-A audio placed in its normal location starting at TRACK 1... This is currently the dominant format for enhanced CDs because it requires no special hardware or software drivers" [i.e., no multisession capability]. A Stamped Multisession CD contains " two discrete 'sessions' or areas of the disc that are completely separate from each other. The audio 'session' is placed on the innermost portion of the disc and the multimedia 'session' occupies the outermost area of the same disc. CDs in this format require special 'multisession' capable CD-ROM drives and software in order to access the multimedia tracks and suffer reduced performance compared to traditional CD-ROMs due to the placement of the data near the outside edge of the disc." They also state "Enhanced CD is a generic term for compact discs which contain music and multimedia. Enhanced CDs usually contain primarily music tracks starting at track 1 and have a minority portion of the disc dedicated to multimedia." In addition "CD PLUS is a term used by SONY and PHILIPS to describe an enhanced CD that adheres to the BLUE BOOK specification". Pohlmann and other sources state a CD Plus (CD Extra) is a "stamped multisession" CD having only 2 sessions (each of the 2 sessions has a separate lead-in TOC and a lead-out area). This specific configuration is defined in the blue book. The first session has redbook CD-audio, allowing it audio to be played on most CD-A players, while the remaining session tracks (often in CD-ROM/XA format) are accessible by a yellow-book multi-session capable CD-ROM reader and ignored by the CD-A player.

CD-Recordable CD-R (CD-WO): This is defined in Orange Book Part II. Up to c. 74 minutes CD-A audio data can be recorded (using a 1.2 m/s CLV), and c. 650 MB CD-ROM data can be stored in the typical sized CD-R. CD-Rs recorded with CD-A ("CD-R-DA") can be played on most Red Book players. However, they differ in utilizing a manufactured pregroove track on which reflectivity changes are recorded in a dye layer. The 600 nm wide spiral pregroove has standard 1600 nm pitch and is imprinted with a timing wobble of ± 30 nm radial excursion at a frequency of 22.05 kHz. The 22.05 kHz wobble is further frequency modulated at 1 kHz, and this modulation provides the basis for an absolute clock timing signal (absolute time in pregroove ATIP) which is used to control the CLV.

The System Use Area (SUA) is not found in CD-A and is positioned at a smaller radius than the lead-in area. It includes an optimal power calibration area PCA or OPCA which is used prior to recording to calibrate laser intensity. The PCA is positioned at 22.35 mm radius (c. -36 seconds ATIP before the normal start of the lead-in at 23 mm). The program memory area PMA is just outside the PCA, and is used to store a temporary TOC until the disk is finalized. These 2 areas are at smaller radii than the usual 23 mm location of the Lead-In (where TOC is stored), so are not seen by standard CD-A players. Before each recording session, the laser uses the OPCA to calibrate intensity—this calibration can be repeated no more than 99 times and often less. The number of calibrations is recorded in the PCA, each as one ATIP frame.

The Information Area includes the lead-in, program area, and lead-out (as in CD-A). When the disk is "closed", the temporary TOC in the PMA is written to the TOC in the lead-in and the lead-out is also formed. Orange book defines a multisession CD-R. A session is defined as a lead-in, data area, and lead-out. In order to keep a disk "open" for further session recording, when a session is closed, another must be opened (the textbook {p. 285} shows that the TOC of a session contains a pointer to the start of the program area of the next session and that a pointer to the start time of the outermost lead-out area must be recorded in or near the first TOC). In Disk-At-Once DAO, the lead-in, data, and lead-out are all recorded at one time without any Red Book imposed mandatory 2-second gap or interruption between possibly multiple tracks, and no information can subsequently be added to the CD-R. (Not all CD-Recorders are capable of DAO and not all software supports even those that can do it). Alternatively, Track-at-once TAO recording can be done allowing [multisession recording with] one or more tracks in a session but with intertrack gaps which may be undesired silence or noise. [This is all rather confusing and other developments may also apply, click here for more. I have had poor luck with multisession recording on my own CD-R. I am unclear if any configuration other than packet writing allows the recording of different tracks within a session at different times. TheUniversal Disk Format (CD-UDF) used for example by Adaptec DirectCD allows packet writing of small amounts of data to the CD-R without closing the track or session {p. 286}... ]. A partially recorded unclosed multisession disk can be read in a CD-ROM but is not usually readable by a CD-audio player. In multisession recording, each successive session has its own lead-in TOC, data, and lead-out, with about 13.2 MB overhead per each additional session. PhotoCD is a type of multisession CD. Standalone consumer CD-R recorders intended for audio use use a special CD with artist royalties built into the price and incorporating SCMS copy protection data.

CD-Rs use an organic dye layer sandwiched between the polycarbonate substrate and the 70-100 nm metalized reflective layer (silver or gold)—this is covered with a thin protective layer which can easily be damaged. The dye layer is tuned to absorb 780 nm (wavelength in air) light for CD-R recording. (Some first generation DVD-ROM drives using 650 nm light won't receive sufficient reflected light to read CD-R data.) The dye layer heats to 250 degrees Centigrade from the recording laser (operating at 4-8 mW compared to 0.5 mW readout), melting and degrading the dye, which reduces its reflectivity on readout. The unwritten groove has a reflectivity of c. 73% (minimum 70%), whereas an 11T "pit" has reflectivity reduced to c. 25%. The dyes used can be affected by aging, heat, and light, especially UV. The dyes are either metal-stablilized cyanine (named for its cyan color, green or blue-green, the original standard, broad light sensitivity, greater compatibility but possibility greater degradation by light exposure) or phthalocyanine (yellow-green or gold, smaller power margin for the writing laser requiring 5.5 mW ± 0.5 mW, possibly greater longevity due to lower sensitivity to ordinary light, though there is not general agreement).

CD-RWs use a phase-change recording layer comprised of an alloy of silver, indium, antimony, and tellurium. The metal exhibits crystalline and amorphous phases depending on the heating rate... CD-RWs are not wavelength dependent like CD-R. Because of their low reflectivity (15 % for amorphous phase, 25% for crystalline), they cannot be read in a CD-ROM player that has not been specifically designed for it, or in CD-audio players...

For additional information on CD-Rs and CD-RWs, click one or more of the following

McFadden CD FAQ (including Lifespan of CD-R data)
Information on CD-R testing [URL formerly http://www.cdpage.com/dstuff/BobDana296.html]

Super Audio CD SACD "Scarlet Book"{p. 295}: This was introduced by Sony and Philips in 1999. It is the same size as a CD and supports a variety of combinations of multichanned recording of proprietary one-bit "Direct Steam Digital DSD" coding in which audio is coded in one-bit pulse density form using sigma-delta modulation and noise shaping. It does not require deicmation filtering, PCM quantization, or interpolation... It competes with and is incompatible with DVD-Audio, but SACD players can play CD using dual lasers. An overall sampling rate of 64x44.1 kHz = 2.8224 MHz of one-bit DSD is used, allowing subdivision into 32, 48, and 96 kHz sampling rates for each of 5 channels (in fact up to 8 channels including standard stereo are possible). A lossless coding algorithm called Direct Stream Transfer DST is used to compress the data c. 2:1. It has a high freq response of 100 kHz and a dynamic range of 120 dB. Noise in the audible range up to 20 kHz appears to be < 140 dB compared to a 1000 Hz test signal, but increases rapidly at higher frequencies, so a 50 kHz lowpass filter must typically be employed. 5.1 DSD layers and conventional CD PCM can be combined in one disk. Single layer capacity is 4.7 Gbytes, dual layer is 8.5. A visible and invisible watermark process is incorporated to prevent illegal copying, including Pit Signal Processing in which the pit width is varied... SACDs can contain text, graphics, and video etc. For more information on SACD, click here.

For information about other CD variants including CD-i, CD-MO including the MiniDisk see p. 287 - 301.

Chapter 10: Perceptual Coding
[Theory and Review of Specific Implementations of Psychoacoustic Data Reduction]

Unlike standard PCM, perceptual coding is highly nonlinear and makes no attempt to accurately preserve the constituent frequencies in a recorded signal (thus introducing high objective "distortion"). Instead, psychoacoustic principles are employed to identify how we perceive sound components and to identify components that are inaudible or masked and therefore excludable from coding. Perceptual coding attempts to reduce the storage requirement and/or data rate over LPCM through means of data reduction by eliminating "irrelevant" information.

Author reviews beat frequencies, sum tones, & pitch versus frequency. The Robinson-Dadson equal loudness curves (highest sensitivity for human hearing is at 1 - 5 kHz) show the ear's subjective response to sound pressure levels plotted against frequency. Differing apparent loudness contours are expressed as phons extending from the minimum audible field at 0 phon to 120 phons and higher. Discussion of the anatomy and function of the cochlea and apparatus of hearing. Ear canal resonates at 3 kHz. Organ of Corti in cochlea has critical bands of c. 24.7 x (4.37F + 1) width, thus about 1/3 octave in the range 300 Hz - 20 kHz... The bark (named after Georg Barkhausen) measures perceptual frequency, so that 1 critical band has a width of 1 bark. It has been shown that a higher intensity sound (the masker) can mask (make inaudible) a nearby frequency (maskee) of a lower amplitude, thereby effectively raising or tenting the threshold of hearing (the phon curve) in the vicinity of the masker {p. 311}—the resulting curve is a masking curve and this type of masking is termed simultaneous amplitude masking. Complex tones produce greater masking than simple sine waves. The greater the amplitude of the masker, the wider the range of frequencies masked. Another type of masking is temporal masking, in which tone A sounded close in time to another tone B can mask it: the maskee can precede the masker (premasking or backward masking) or can follow it (post masking or forward masking). The combined effect of simultaneous amplitude and temporal masking is to create a raised ridge in the 3-D time-frequency contour of masking curve plotted against time. Data reduction is also accomplished in multichannel recording by joint stereo coding, in which interchannel redundancy and masking is identified and reduced by coding the information common to the two channels only once. Data reduction of 4:1 to 6:1 can be "transparent" in some settings and up to 12:1 reduction is typical in audio applications. Playback requires a decoder reversing the steps of encoding. One potential problem with data reduction lossy perceptual encoding is that cascaded coders can serially degrade the signal.

Some data reduction techniques, such as NICAM (Near Instantaneous Companded Audio Mulitplex) use time-domain coding. Most perceptual coders (or codecs, a term including coding and decoding) operate in the frequency domain, often employing the Fast Fourier Transform FFT. FFT is a computationally fast way performing Fourier transformations to obtain frequency domain points equal in number to 1/2 the number of time samples. E.g., if 480 samples are made over 10 msec (at 48 kHz sampling frequency), 240 frequency points result from this sample with maximum frequency 24 kHz and minimum freq of 100 Hz, plus a dc point. Frequency domain encoders are divided (somewhat artificially) into subband coders (good time resolution, lower frequency resolution) and transform coders (good frequency resolution, lower time resolution).

Subband coding was developed in the Bell labs 1980s. A digital filter bank (e.g., 32) divides the signal into bandlimited channels to approximate the critical bands. Each subband is coded independently... {p. 317} The energy content in each band is analyzed to determine which contain audible information—those which do not contain audible information are not encoded, nor are tones masked by adjacent tones. Audible subbands are quantized according to a priority schedule, allocating a number of bits according to the relative audibility of each subband (i.e., its signal to mask ratio SMR). Note that if many subbands contain audible signal (as with a complex sound such as a symphony orchestra), it is possible that the number of available bits will be exhausted if the total bit rate is fixed and that the encoding will be correspondingly suboptimal.

Transform coding {p. 323} use the discrete Fourier transform (DFT) by means of the FFT, or the modified discrete cosine transform MDCT. These use a finite (discrete) block of time-domain samples and convert them to the frequency domain. The resulting coefficients (i.e., the amplitude coefficients of the constituent frequencies) are quantized according to the psychoacoustic model, eliminating masked components, etc. Instead of using frequency analysis as with subband coding, transform coding codes frequency spectral coefficients ("bin numbers"). By reducing entropy, TC improves efficiency. Longer sampling blocks improve spectral resolution but decrease temporal resolution (this can lead to a preecho preceding a loud transient). Successive blocks are typically overlapped by 50% to improve temporal resolution. E.g., a 512-point transform can produce 256 coefficients or bins, which are grouped into c. 32 bands as in critical band analysis. In adaptive transform coders, a bit allocation algorithm is used to optimize quantization noise to achieve a desired S/N ratio...

Filter banks {p. 325} are employed. MPEG-1 layer 1 and II use 32-band polyphase filters. MDCT can also be used to allow critical sampling... Hybrid filter banks use a cascade of different filter types (polyphase and MDCT). E.g., MPEG-1 Layer III uses PQMF Polyphase Quadrature Mirror Filter and MDCT filters...

MPEG-1 Layers I, II, and III: The Moving Pictures Expert Group was formed by the ISO and IEC (International Electrotechnical Commission), which published the ISO/IEC International Standard 11172, finalized November 1992, now commonly referred to as MPEG-1. The audio standard is 11172-3 and is widely used in Video CD, CD-ROM, ISDN, video games, and digital broadcasting as well as downloadable music files. Only the decoders are defined in the standard. Maximal audio bit rate is 1.856 Mbps, and coding of PCM data at 32, 44.1, and 48 kHz is supported, with stereo bit rates ranging from 64 to 224 kbps/channel... Layer III's precursor techniques were MUSICAM (Masking-pattern Universal Subband Integrated Coding & Multiplexing) and ASPEC (Adaptive Spectral Perceptual Entropy Coding)... There are three layers, with progressively greater degrees of sophistication, complexity, and compression. Layer III can operate at the lowest bit rate of 64 kbps/channel and for a given bit rate, Layer III generally offers the greatest audio fidelity. (Layer II operates at 96-128 kbps/channel, Layer IIa is joint stereo and operates at 128 - 192 kbps for the 2 channels combined with only a small increase in complexity.) Layers II and III MPEG data is transmitted in frames {p. 330} containing 1152 samples. All layers use 32 subbands, and Layer III transforms each subband into 18 spectral coefficients by a MDCT yielding a maximum of 576 coefficients each representing a bandwidth of 41.67 Hz at 48 kHz sampling rate and a time resolution of 24 msec . The number of bits per frame in layer III can optionally be allowed to vary (variable bit rate), allowing flexibility to encode more demanding signals, but this method cannot be used in applications expecting a stream of constant bitrate. To reduce preecho for a transient, Layer III allows temporary switching to a shorter sampling interval with 8 msec resolution but only 192 spectral lines... Stereo recording modes in Layer III includes normal (channels are independent), MS (mid/side) stereo, intensity stereo mode, and intensity & MS mode {p. 343}. Nonuniform dynamic quantization is used with Huffman and run length entropy compression. Testing has shown that Layer III at 2 x 128 = 256 kbps or 192 joint stereo can convey typical (non-worst-case) stereo audio with no audible degradation ("transparency") compared to 16-bit linear PCM (which has a bit rate of 1.4 Mbps, thus a compression ratio of 5.4:1 or 7.3:1). Multiple cascaded code/decode cycles can degrade the signal. A MPEG-1 Layer III file is called an MP3 file. For more information about MPEG-1, refer to the text {p. 327 - 343}.

MPEG-2 Audio: Introduced to allow multichannel recording (at 32, 44.1, and 48 kHz) compatible with MPEG-1 as well as lower sampling frequencies (LSF, 16 kHz etc) which are not MPEG-1 compatible. MPEG-2 is widely used in computer multimedia, CD-ROM, DVD-video, LANs, studio recording, ISDN transmission, digital audio broadcasting, multichannel TV, etc.

MPEG-2 Advance Audio Coding AAC is a non-backward compatible (NBC) component of MPEG-2, specified in ISO/IEC 13818-7 April 1997 {p. 345}. It codes typically at 64 kbps per channel and performs better than MP3. Sampling rates are 8 - 96 kHz for up to 6 channels. It allows downmixing to blend multichannels into stereo. Intensity and MS coding modes are used for stereo. In testing, MPEG-2 AAC at 128 kbps outperformed 128 kbps MPEG-2 Layer III... .Fraunhofer IIS-A states "Due to its high coding efficiency, AAC is a prime candidate for any digital broadcasting system."

AC-3 ("Dolby Digital" Audio Coding 3) Coder {p. 348}: This is used in DVD-video, the audio component of ATSC DTV, DBS, cable and satellite distribution, and commercial cinema and is proprietary to Dolby. It was preceded by Dolby's AC-1 and AC-2. It codes 1 to 6 channels. There is a dialog normalization capability allowing dialog level to be adjusted relative to other sound in cinema... In cinema, it provides 5.1 channels (Dolby Digital Surround EX adds a 6th center surround channel).

apt-X100 Coder: Used for DTS cinema audio {p. 241, 357}.

Evaluation of Perceptual Coding Performance: Objective testing can include use of a multitone test signal and determination of what portion of the noise rises near or above the "masking curve" (the psychoacoustically-defined curve defining the sound levels masked by the effect of the multitiones). A noise-to-mask ratio NMR is computed, and positive values indicate audible noise artifact.

However, perceptual coding must ultimately be evaluated with exhaustive repeated expert listening using worst-case sound samples containing complex rich content in 1 - 5 kHz range— such as glockenspiel, castanets, triangle, harpsichord, tambourine, speech, trumpet, and bass guitar samples. Audible artifacts can include changes in timbre, bursts of noise, granular ambient sound, and shifting of stereo image, preecho, tinkling, etc. 16-bit PCM is not an adequate comparison standard for current-generation perceptual coders, since many coders outperform this standard! The 5-point scale of the International Radio Consultative Committee (CCIR) defines an impairment scale ranging from 5 imperceptible to 1 Very Annoying relative to the uncoded sound. Another guideline is that of the International Telecommunication Union-Radiocommunication Bureau (ITU-R) Recommendation BS.1116. This again stresses using a variety of test materials. Such testing by Soulodre for stereo codecs using worst-case samples consisting of bass clarinet arpeggio, bowed double bass, and harpsichord arpeggio, pitch pipe, and muted trumpet, and compared against a CD reference, found the following order of audio quality: MPEG-2 AAC > Lucent Technologies' Perceptual Audio Coder (PAC) {p. 569} > MPEG Layer III > Dolby Digital AC-3 > MPEG Layer II > IT IS.

Chapter 11: DVD
Chapter 12: The Minidisc
Chapter 13: Interconnection
Chapter 14: PC Audio
Chapter 15: Internet Audio
Chapter 16: Digital Radio and Television Broadcasting
Chapter 17: Digital Signal Processing
Chapter 18: Sigma-Delta Conversion and Noise Shaping

These chapters have not been summarized. For extensive information on DVD, see the DVD FAQ by Jim Taylor.

Chapter 1: Sound and Numbers [Brief review of physical principles of sound and relevant mathematics]

Chapter 2: Fundamentals of Digital Audio [Review of basic overall principles]

Chapter 3: Digital Audio Recording [Overview of steps involved in digital recording]

Chapter 4: Digital Audio Reproduction [Overview of steps involved in digital reproduction or playback]

Chapter 5: Error Correction [Mathematical and Physical Technologies for Error Detection, Correction, and Prevention]

Chapter 6: Magnetic Tape Storage

Chapter 7: Digital Audio Tape DAT [Summary of Audio and Tape backup uses]

Chapter 8: Optical Disk Storage [Summary of physics of light and overview of optical storage techniques including cinema]

Chapter 9: The Compact Disk [Details of CD-Audio and Other CD Formats]

Chapter 10: Perceptual Coding [Theory and Review of Specific Implementations of Psychoacoustic Data Reduction]

Chapter 11: DVD Chapter 12: The Minidisc Chapter 13: Interconnection Chapter 14: PC Audio Chapter 15: Internet Audio Chapter 16: Digital Radio and Television Broadcasting Chapter 17: Digital Signal Processing Chapter 18: Sigma-Delta Conversion and Noise Shaping