Human hearing, psychoacoustics

From Helpful
Jump to: navigation, search
This page is in a collection about both human and automatic dealings with audio, video, and images, including


Audio physics and physiology

Digital sound and processing


Image

Video

Stray signals and noise


For more, see Category:Audio, video, images

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Psycho-acoustics is a study of various sound response and interpretation effects that happen in the source-ear-brain-perception path, particularly the ear and brain.

There are various complex topics in (human) hearing. If you mostly skip this section, the concepts you should probably know about the varying sensitivity to frequencies, know about masking and such, and know that practical psycho-acoustic models (used e.g. for things like sound compression) are mostly a fuzzy combination of various effects.


Some physiology

Inner and outer ear: Cochlea, basilar membrane, meatus, and more

Some summarizing notes

Results of said physiology, models

Perceptual loudness of frequencies

Physical reality can be described in terms of pressure or intensity, but human perception of the same sounds sounds cannot.

The most important detail is probably the fact that we hear the same physical amplitudes of different frequencies as different intensities. For an extreme example, a pure 40Hz tone needs to be about 45dB louder than a pure 4000Hz tone to be perceived as just about as loud (note that 45dB is roughly a multiple of 30000 of energy used).

Equal loudness contours

This difference is caused by the way our ears work. This has been measured several times, notably by Fletcher and Munson (1933), and a little more accurately by Robertson and Dadson (1956), and more accurately since that (see e.g. ISO 226:2003 for details).

These tests used near-perfect listening conditions, middle-aged people, and perhaps most importantly, pure tones. This last detail limits their value of direct application in the face of complex signals, noise, temporal psycoacoustics (pulses, listener fatigue), and such.


The test results are often viewed as a graph of equal loudness contours, which indicate the perceived difference when a (simple) sound at a particular db(SPL) level changes frequency. A different way to see the contours is the amount of amplitude change the ear applies for a tone at a frequency and amplitude.


There are of course other effects on how you hear each frequency, some from the environment and physics (interference, absorption), some psychoacoustic (frequency masking, temporal masking, listener fatigue, reaction to pulses), some related to quality of your reproduction hardware (headphones often offer better detail than speakers, cheap sound cards may actually alias) and others.


There are other effects that you could call psychoacoustic. For example, our judge of how loud a sound system is depends not only on how loud it is but also on how much it is distorting.


approximate equal loudness adjustments (graphed with logarithmic Hz scale)
If you digitize the equal loudness curves, flip it to gear it towards subtracting 0dB to whatever amount of dB applies for a frequency, you'll get something like like the graph on the right. If you have the data this is based on, it's fairly simple to adjust post-FFT data for loudness.


See also:

  • Fletcher, H., and Munson, W.A. Loudness, its Definition, Measurement, and Calculation. Journal of the Acoustical Society of America, 5, (2), 82-108, 1933
  • Robinson, D. W., and Dadson, R.S. A Redetermination of the Equal-loudness Relations for Pure Tones. British Journal of the Applied Physics, 7, 166-181, 1956




Frequency filters and measures

There are a number of filters/measures, most of which try to considering equal loudness curves, and/or bias to specific purposes.

Note that most are designed for simple signals, even if they are regularly used for complex ones (noise, music). The main exception in this list is ITU-R 468, which is more valid on complex signals than most others.


Perhaps best known is dBA and to a lesser degree dBC, commonly used to mechanically measure sound levels in roughly human terms. (notation varies - dBA is also seen as dB(A), dBa, and other mild writing variations. Note that dBa has a different meaning in some contexts)
Adjustments according to db(A), db(B), db(C), fairly simple shapes
.

dB(A) cuts low frequencies, and in general is a simple approximation of human frequency response. It was made to measure relatively silent sound, and low-level noise pollution - things that are annoying, but not damaging. Sound level meters on mixing panels and similar are often A-weighed, presumably because it roughly resembles human weighing (probably also because it's simple to implement).

dB(C) is meant to approximate the ear at fairly loud sound levels, and leaves in more low frequencies - but not enough to evaluate the effect of low bass. It seems to often be used in traffic loudness measurements.


dB(Z) refers to zero: it's flat for most of the spectrum, but has more defined cutoff points than 'flat' ratings left up to manufacturers.(verify) Not so useful for perceived loudness, but arguably the most useful to evaluate potential hearing damage.


Old and/or specific-purpose:

dB(B) lies somewhere between A and C, both in terms of frequency response and intended loudness levels. You could say it roughly models the ear at medium sound levels. It seems to be rarely used, perhaps because for ear modelling there are better models that are not much more complex.

dB(D) was intended for loud aircraft noise. It has a peak around 6kHz that models how people sense random noise differently from tones, particularly around there (verify).

dB(G) focuses primarily around sub-bass (20Hz) and is useful when measuring large slow movements, like wind turbines.



ISO 226 has a more complex frequency response than dBA/B/C, with pre-2003 versions based on the Robertson-Dadson results, and the 2003 version being based on revisions based on more recent equal loudness tests.


ITU-R 468 (note: ITU-R used to be CCIR), is a better approximation for noise and complex signals (dBA and its family were designed for pure tones), and also models our reduced sensitivity to short bursts and clicks to some degree. R468 has seen a lot of use in some specific fields.


Loudness meters vary in design. They may be unfiltered, use dBA, dBC, and sometimes other filters, they may be slow or fast in response to large level differences (e.g. a peak programme meter is slower, almost ignoring few-millisecond peaks, since human hearing is tolerant of short distortion), decay slower to give a multi-tasking broadcast operator more of a chance to get an idea of recent peaks, and have other variations.



Phons and sones were proposed perceptual loudness units / experiments, and neither are in particularly regular use other than being referenced by some definitions.


Phons: At 1kHz, 1 phon is defined as 1 dB SPL. For other frequencies it is adjusted following the equal loudness curves.

Sones: A phon-based exponential scale with base two. The definition says that at 1kHz, 1 sone is 40 phons (probably that makes 1 sone a practical quiet sound rather than a theoretical hearing limit). The idea behind the base two is that a perceptual doubling of intensity (~10dB) means a doubling of the sones:

  sones     phons / dBSPL@1kHZ
   0.5             30 (verify)
   1               40
   2               50
   4               60        
   8               70
  16               80
  32               90 
  64              100    
 128              110    
 256              120    
 512              130
1024              140

(For frequencies other than 1kHz this must be adjusted according to equal loudness curves).


See also:

And perhaps:

Frequency perception

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

In music, tones are usually seen in a way that originates in part from scientific pitch notation, in which an octave is a doubling of frequency. This already illustrates the non-linear nature of human hearing, but is in itself not actually an accurate model of human pitch perception.

Accurate comparison of frequencies and frequency bands can be an involved subject.


Perceived equal frequency intervals/distances are not easily caught in a more complex function. If you've ever heard a frequency generator slowly and linearly increase the frequency, you'll know that it sounds to us like fast changes at the start and past ~6KHz it's all a slightly changing high beep.

In addition, or perhaps in extension, the accuracy with which we can can judge similar tones as not-the-same changes with frequency, and is also not trivial to model.

Frequency warping is often applied to attempt to linearize perceived pitch, something that can help various perceptual analyses and visualizations.


The critical bandwidth increases and the pitch resolution we hear decreases.

The non-linear nature of frequency hearing, the existence and the approximate size of critical bandwidths is useful information when we want to model our hearing.

It is also useful information to things like lossy audio compression, since it tells us that spending equal space on perceived tones means we should expend coding space non-linearly with frequency.


For background: Mathematical frequency intervals

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

In the mathematical scientific pitch notation, an octave refers to a doubling of the frequency, cents are a (log-based) ratio defined so that there are 1200 steps in an octave, of a multiple 21/1200 each.

Note that the human Just Noticeable Difference tones is perhaps 5 cents, although tuning should be more accurate than that

Notes can be referenced in a note-octave, a combination of semitone letter and the octave it is in. For example, middle C is C4, the C in the fourth octave. This scale is anchored at A4, at 440Hz.

See also:


Bandwidths, frequency warping, and more

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Note that studies and applications in this area usually address multiple things at once, usually one or more of:

  • estimation of the width of the critical band of a band / at a frequency
  • frequency warping (Hz to a perceptually linear scale). (Note that a fitting critical band data can often be a decent approximate warper)
  • working towards a filterbank design that is an accurate model in specific or various ways (there are a number of psychoacoustic effects that you could wish to model, or ignore to keep things simple)


Also useful to note:

  • since some terms refer to a general concept, it may refer to different formulae, and to different publications that pulls in more or less related details than others.
  • There are a good amount of convenient approximations around, which tend to confuse various summaries (...such as this one. I'll warn you against assuming this is all correct.)
  • while bands are sometimes reported as given widths around given centers, particularly for approximate filterbank designs, bands do really not have fixed positions. It is more accurate to consider a function that reports a width for arbitrary frequencies.
  • There is often some rule-of-thumb knowledge, and various models and formulae that are more accurate than either of those - but many formulae are still noticeably specific and/or inaccurate, often mostly for lowish and for high frequencies.

It's useful to clearly know the difference between critical band rate functions (mostly a frequency warper, a function from Hz to fairly synthetic units) and critical bandwidth functions (estimation of the bandwidth of hearing at given frequency, a Hz→Hz function). They are easily confused because they have similar names, similarly shaped graphs, and the fact that they approximate rather related concepts.

A quite understandable source of confusion is that in many mentions of critical band rate, it is noted that the interval between whole Bark units corresponds to critical bandwidths. This is approximately true, but approximately at best. It is often a fairly rough attempt to simplify away the need for a separate critical bandwidth function. One of the largest differences between these two groups seems to be the treatment under ~500Hz, and details like that ERB lends itself to filterbank design more easily.

Another source of confusion is naming: Earlier models often have a name involving 'critical band', later and different models often mention ERB (equivalent rectangular bandwidth), and it seems that references can quite fuzzily refer to those two approximate sets, the whole, or sometimes the wrong ones, making these names rather finicky to use.


Critical band rate:

  • Primarily a frequency warper
  • Units: input in Hz, output (usually) in Bark
  • often given the letter z
  • approximately linear to input frequency up to ~200Hz (~0 to 2 bark)
  • approximately linear to log of input frequency for ~500Hz..10kHz (~5 to 22 bark)

Critical bandwidth:

  • a function to approximate the width of a critical band at a given frequency (Hz->Hz)
  • Units: input in Hz, output in Hz
  • useful for model design, usually along with a frequency warper
  • approximately proportional to input frequency up ~500Hz (sized ~100Hz per band)
  • approximately proportional to log of input frequency over ~1kHz (sized ~20% of center frequency)


If you like categories, you could say that you have CB rate functions (bark units), ERB rate function (ERB units), and from both those areas there are bandwidth functions.


Equivalent Rectangular Bandwidth (ERB) and ERB-rate formulae (both introduced some time after most mentioned critical-band and bandwidth scales) approximate the relationship between the frequency and ear's according the critical bandwidth. More specifically, they do so using a modelled rectangular passband filter (with the same passband center as the auditory filter it models, and has similar response to white noise).

ERB bandwidth function estimate bandwidth noticably smaller than critical bandwidth for low frequencies (below ~500Hz).

There are a number of different investigations, measurements, and approximations to calculate an ERB.

The formulae most commonly used seem to be from (Glasberg & Moore 1990)(verify).




On Barks:

The term Bark scale (somewhat fuzzily) refers to most critical band rate functions.

The Bark unit was named in reference to Heinrich Barkhausen (who introduced the phon). Using bark as a unit was proposed in (Zwicker 1961), which was also perhaps the earliest summarizing publication on critical band rate (and makes a bark-unit-is-approximately-a-bandwidth note), and which reports a number of centers, edges.

Bark is related to, but somewhat less popular than the Mel scale.

The value of the Bark scale in tables often goes from 1 to 24, though in practice it can be a function meant to be usable from 0 to 25. (the 24th band covers a band up to 31kHz. The 25th is easily extrapolated, and useful to e.g. deal with 44kHz/48kHz recordings).


There is also the Mel scale, which like Bark was based on empirical data, but has slightly different aims

  • Frequency warping scale that aims for linear perceptual pitch units (verify)
  • Defined as f(hz) = 1127.0148 * loge(1+hz/700)
  • Which is the same as 2595 * log10(1+hz/700)
  • and some other definitions (approximation or exactly?)
  • The inverse (from mel to Hz) is f(m) = 700 * (em/1127.0148 - 1)



Various approximating functions
(early version of a graphs of various related functions)


CB rate

Bark according to Zwicker and Terhardt 1980:

13*arctan(hz*0.00076) + 3.5*arctan((hz/7500)2)
  • error varies, up to ~0.2 Bark


Bark according to Traunmuller 1990:

f(hz) = (26.81*hz)/(1960.0+hz) - 0.53
  • seems more accurate than the previous, particularly for the 200-500Hz range
  • seems to be the more usual function in use


ERB rate

11.17 * loge( (hz+312)/(hz+14675) ) + 43.0

(mentioned at least in Moore and Glasberg 1983)


Bandwidth

Very simple rule-o-thumb approximation ('100Hz below 500Hz, 20% of the center above that'):

max(100,hz/5)

(Zwicker Terhardt 1980, or earlier?)

25 + 75*( 1 + 1.4*(hz/1000.)2 )0.69

...with bark (and not Hz) as input:

52548 / (z2 - 52.56z + 690.39)


6.23*khz2 + 93.39*khz + 28.52
See also
  • Traunmüller (1990) Analytical expressions for the tonotopic sensory scale, J. Acoust. Soc. Am. 88: 97-100
  • Glasberg, Moore (1990), Derivation of auditory filter shapes from notched-noise data, Hear. Res. 47, 103-138
  • Moore, Glasberg (1987) Formulae describing frequency selectivity as a function of frequency and level, and their use in calculating excitation patterns
  • Moore, Glasberg (1983) Suggested formulae for calculating auditory-filter bandwidths and excitation patterns, J. Acoust. Soc. Am. 74: 750-753
  • Zwicker, Terhardt (1980), Analytical expressions for critical-band rate and critical bandwidth as a function of frequency, J. Acoust. Soc. Am., Volume 68, Issue 5, pp.1523-1525
  • Zwicker (1961) Subdivision of the audible frequency range into critical bands (Frequenzgruppen), J. Acoust. Soc. Am. 33: 248

Unsorted:



Filterbanks

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

The idea is to get a basic imitation of the ear's response.

While the following implementation is still a relatively basic model of the human cochlea, it reveals various human biases to hearing sound, and as such are quite convenient for a number of perceptual tasks.


The common implementation is a set of passband filters, often with overlapping reponse, to model response to different frequencies. The excitation of a sound on the (~34mm-long) basilar membrane is approximately 1.5mm large, which leads to the typical choice of using 20 to 24 of these filters.

That number of centers, and the according bandwidths (which are not exactly based on the 1.5mm) vary between different models' assumed facts and the frequency linearization/warping used and the upper limit on frequency that you want the model to include. (For a sense of size: the most more important bandwidths are usually on the order of 100 to 1000Hz)

That frequency resolution may seem low, but is quite decent considering the size of the cochlea. The Just Noticeable Differences seem to be sub-Hz at a few hundred Hz, up to ten Hz at a few kHz, up to perhaps ~30Hz at ~5-8kHz.

Reports vary considerably, probably because pure sines are much easier than sounds with timbre, vibrato, etc. so it can easily double over these values. (verify)

Regardless, note that this is much better than the 1.5mm excitations suggest; there is obviously more at work than ~24 coefficients, and for us, much of this happens in post-processing in the brain.

Hearing damage

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Masking

Other psychoacoustic effects

Localization

Selective attention

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Auditory illusions

See also


  • Brian Moore, "Introduction to the psychology of hearing"
  • H. Fastl, E. Zwicker, "Psychoacoustics: Facts and Models" (relatively mathematical)

Unsorted: