Difference between revisions of "Descriptions used for sound and music"

From Helpful
Jump to: navigation, search
m (MusicIP, MusicDNS, AmpliFIND)
m (What a signature represents)
 
(38 intermediate revisions by the same user not shown)
Line 1: Line 1:
{{Audio and signal processing}}
+
{{avnotes}}
{{stub}}
+
  
 
===Physical effects and/or fairly well studied===
 
===Physical effects and/or fairly well studied===
Line 30: Line 29:
 
* http://en.wikipedia.org/wiki/Attenuation
 
* http://en.wikipedia.org/wiki/Attenuation
 
* http://en.wikipedia.org/wiki/Stokes%27_law_(sound_attenuation)
 
* http://en.wikipedia.org/wiki/Stokes%27_law_(sound_attenuation)
 
  
 
====Tone versus noise content====
 
====Tone versus noise content====
Line 52: Line 50:
  
  
====Free and diffuse field====
+
====Sound field descriptions====
These describe environments instead of qualities, but should probably be named closely to reflection and absorption, as rooms are almost always diffuse fields with reverb of some sort.  
+
Note that:
 +
* These describe environments instead of sound qualities,  
 +
: ...yet often still ''relate'' to qualities, like how many relate to reverb somehow.
 +
* 'Sound field' usually refers to a specific area (or rather volume)
 +
* Note that some of these are more standardized terms (see e.g. ISO 12001) than others.
 +
 
 +
 
 +
 
 +
A '''free field''' refers to environments where sound is free to propagate without obstruction. {{comment|(In practice the most relevant objects are reflective surfaces (like walls), so free field often used to mean lack of reverb - but also other implied effects such as [[room modes]])}}
 +
 
 +
 
 +
'''Direct field''' is the part of a sound field that has no reflections.
 +
 
 +
'''Reverberant field''' is the part of the sound field that has some reflections
 +
 
 +
A '''diffuse field''' describes an environment with no preferred direction.
 +
A specific (and common enough) case of that is that there are so many reflections that it's more or less uniform.
 +
(can also be used to refer to light and other EM)
 +
 
 +
 
 +
Most rooms are reverberant / diffuse, with the world of variation.
 +
For example, empty rooms, cathedrals, and gyms and such have noticeably more reverb than rooms filled with randomly shaped and/or soft objects to scatter or absorb sound.
  
 +
Anechoic chambers are rooms that attempt to remove all echo and reverb, to simulate a space with only the source in question, and at the same time have the environment act as a free field. It is typical to see porous, wedge-shaped sound absorbers {{comment|(in part because the alternative is to have a huge space - and still some absorption)}}.
  
A '''diffuse field''' describes an environment with a high number of reflections, with reverberating repeats that arrive from all directions, in a more or less uniform way. (can also be used to refer to EM, e.g. light)
 
  
  
A '''free field''' refers to environments without boundaries and reflections, which is reverb-free.
+
'''Near field''' is the area around an emitter close enough where the area of emitter still matters (since all of it emits the sound), via interference and phase effects, and that physically, the pressure sound pressure and velocity are not in phase.
 +
: This also tends to imply the volume-per-distance dropoff (usually 6dB per increase) goes a little funny close to an object
 +
: size of the near field varies with frequency and sound source size
 +
: which is e.g. relevant for microphones specifically used for nearby voices
  
 +
: A near-field monitor (which should actually be called direct field monitors, but studio engineer consider the two the same thing) means placing speakers near you so that most sound is without room reverb - which is important in mastering/mixing
  
Rooms in general are diffuse fields. Note that empty rooms, cathedrals and such have more noticeable reverb than rooms filled with as there are few soft objects to absorb sound.
 
  
Anechoic chambers are rooms that attempt to remove all echo and reverb, to simulate a space with only the source in question, and at the same time have the environment act as a free field. It is very common to see porous, wedge-shaped sound absorbers.
+
'''Far field''' is "far enough that the near field effect doesn't apply".
 +
Note that there will be a transition between the two, and where that is depends on frequency.
  
 
====Resonance====
 
====Resonance====
Line 113: Line 136:
  
 
====Beat and tempo====
 
====Beat and tempo====
{{comment|(Terms like beat and tempo are not very strictly defined or very consistently used. Learn the concepts, and read between people's lines.)}}
 
  
 +
The terminology around beat is often used a little fuzzily,
 +
and some of it matters more to performance or rhythmic feel,
 +
so in more basic description you care first about pulse,
 +
the regularity of the beats regardless of precise rhythmic use.
  
Not necessarily a hard-hitting thing, '''tempo''' or ''' 'the beat' ''', is the thing we intuit into our foot-tapping when playing and listening to music.
 
It is often a common interval, a base roster for musical events.
 
  
'''Meter''' is a rhythmic structure (often within a bar, or a few, whereas tempo was classically for a piece),
+
Which, for a lot of techno and other electronic music, is just every beat.
'''groove''' more as the rhythmic feel that meter/tempo has.
+
For some other music styles it is a somewhat more complex thing, with short-term and longer-term patterns. Which sometimes get so crazy humans have trouble describing it, or even feeling it.
  
  
For DJs mixing techno songs, it can be equated with the onsets of the loudest beats.  
+
The tempo of most music lies within the 40-200 beats per minutes (BPM).
For other music it is a more complex thing.
+
The median varies with music style, but often somewhere around 105 BPM.
The extremes are somewhat interesting to mention, if only to point out that at some point it becomes so ill defined that even humans can't make anything of it.
+
  
  
The tempo of most music lies within the 40-200 beats per minutes (BPM), with a median varying with music style, but often somewhere around 105 BPM.
+
<!--
 
+
You can also use measures per minute {{comment|(or bars per minute, but that would be the same acronym)}}, e.g. used in ballroom dancing as it is a more practical measure of speed to dance at than BPM.  
You can also use measures per minute {{comment|(or bars per minute, but that would be the same acronym)}}, e.g. used in ballroom dancing as it is a more practical measure of speed to dance at than BPM. <!--
+
 
You could also care about it when dealing with perception of time signatures, nontrivial time signatures, or changing time signatures. For example,  
 
You could also care about it when dealing with perception of time signatures, nontrivial time signatures, or changing time signatures. For example,  
 
a 120BPM 4/4 song is 30MPM,  
 
a 120BPM 4/4 song is 30MPM,  
Line 136: Line 158:
 
a 120BPM 2/2 song os 60MPM.  
 
a 120BPM 2/2 song os 60MPM.  
 
-->
 
-->
 
 
For context: '''onset''' is primarily the start of a sound, specifically its sudden increase in amplitude. It is primarily an analytical thing, and research into human judgment of onsets is ongoing.
 
: Onset is one way to go about computational beat/tempo detection, but not robust for all music types
 
: Onsets don't always match the perception of tempo - consider e.g. blues with guitars, where fast strumming would easily make algorithms decide a factor higher than most humans would.
 
  
  
Line 146: Line 163:
 
Sheet music will occasionally mention the tempo, but much more commonly have a time signature which defines the amount of beats in a bar (and the default resolution/length of tones in it) -- a regularity within the beats. A large amount of (particularly pop) music is 4/4, waltzes are 3/4, and some music is a bit more complex than this.
 
Sheet music will occasionally mention the tempo, but much more commonly have a time signature which defines the amount of beats in a bar (and the default resolution/length of tones in it) -- a regularity within the beats. A large amount of (particularly pop) music is 4/4, waltzes are 3/4, and some music is a bit more complex than this.
 
-->
 
-->
 
  
 
<!--
 
<!--
'''Computational analysis'''
+
See also:
 +
* https://en.wikipedia.org/wiki/Beat_(music)
 +
* https://en.wikipedia.org/wiki/Time_signature
 +
 
 +
* https://en.wikipedia.org/wiki/Meter_(music)
 +
* https://en.wikipedia.org/wiki/Groove_(music)
 +
-->
 +
 
 +
 
 +
 
 +
=====Computing BPM=====
 +
 
 +
The simplest form to detect tempo of music is to focus entirely on the punchy bassy beat.
 +
 
 +
 
 +
The simplest form of that may be to do some heavy lowpassing/bandpassing (leaving mainly 50-100Hz)
 +
and look for onsets, where onsets are the start of a longer-lasting low-frequency thing, specifically its sudden increase in amplitude.
 +
 
 +
Onsets are a nice sounding approach because they take away a lot of complex frequency stuff,
 +
and also allow you to focus on the slower stuff - after all, 60BPM is one thing per second
 +
and 180 BPM still just looking at 300ms-long things.
 +
 
 +
 
 +
Now, most of that may be ''relatively'' simple on a punchy beat, but harder on more complex sound.
 +
Research into human judgment of onsets is complex and ongoing.
 +
 
 +
 
 +
Onsets don't always match the perception of tempo anyway - consider e.g. blues with guitars, where fast strumming being clear and periodic onsets, hich would easily make algorithms decide a factor higher than most humans would due to the style.
 +
 
 +
 
 +
Methods may implicitly assume a straight beat,
 +
and fall apart around blues shuffles, swing, use of triplets,
 +
stronger off-beats, syncopation, and basically ''any'' more interesting rhythm,
 +
 
 +
Some of that can be fixed by trying to detect the pulse, with some basic assumptions.
 +
 
 +
And if you're going to try to detect measures/bars, then you probably want to consider downbeat detection, detecting which beat is first in each measure.
  
Tasks include
+
All this involves more and more music theory and assumptions.
* Periodicity detection
+
* Beat detection - basically the feet-tapping
+
* '''downbeat detection''' means detecting which beat is first in each measure.
+
  
  
Line 186: Line 235:
 
Goto & Muraoka (1994), "{{search|A Beat Tracking System for Acoustic Signals of Music}}"
 
Goto & Muraoka (1994), "{{search|A Beat Tracking System for Acoustic Signals of Music}}"
 
: suggests a sort of multi-hypothesis system looking at several  
 
: suggests a sort of multi-hypothesis system looking at several  
 
TODO: look at
 
: Goto 2001 "{{search|An Audio-based Real-time Beat Tracking System for Music With or Without Drumsounds}}"
 
: Dixon 2001 "{{search|Automatic extraction of tempo and beat from expressive performances}}"
 
 
 
  
  
Line 202: Line 245:
 
: http://werner.yellowcouch.org/Papers/beatgraphs12/
 
: http://werner.yellowcouch.org/Papers/beatgraphs12/
  
Tempogram
+
 
 +
Tempogram:
 +
: local autocorrelation of the onset strength envelope.
  
  
Line 209: Line 254:
  
  
 
-->
 
 
<!--
 
See also:
 
* https://en.wikipedia.org/wiki/Beat_(music)
 
* https://en.wikipedia.org/wiki/Time_signature
 
 
* https://en.wikipedia.org/wiki/Meter_(music)
 
* https://en.wikipedia.org/wiki/Groove_(music)
 
  
  
 +
TODO: look at
 +
: Goto (2001) "{{search|An Audio-based Real-time Beat Tracking System for Music With or Without Drumsounds}}"
 +
: Dixon (2001) "{{search|Automatic extraction of tempo and beat from expressive performances}}"
 +
: Dixon (2006) "{{search|Onset Detection Revisited}}"
 +
: Alonso et al. (2004) {{search|Tempo and Beat Estimation of Musical Signals}}"
 +
: Collins (2012) {{search|A Comparison of Sound Onset Detection Algorithms with Emphasis on Psychoacoustically Motivated Detection Functions}}
  
 
-->
 
-->
Line 271: Line 312:
  
  
 +
-->
 +
 +
=====Computing musical key=====
 +
 +
<!--
 +
 +
For audio, this is tackled in two steps:
 +
* transcribing audio to MIDI (or similar internal format)
 +
* analysing the notes
 +
 +
 +
This because having it quantized to notes makes the second step much easier.
 +
 +
In theory, you can get relatively far just seeing how the notes fit in each possible key.
 +
 +
There are various reasons this only gets you so far:
 +
* songs use notes outside of the key
 +
 +
* compositions may change key, which is somethin
 +
 +
 +
 +
Krumhansl-Schmuckler
 +
 +
 +
These methods still easily make a few mistakes, including
 +
* reporting a [[parallel key]] (same root, but mistaking between major/minor because they are largely compatible)
 +
* reporting a [[relative key]] (basically, for every major key there is a minor key starting somewhere else that uses the same notes)
 +
* being off by a fifth or fourth, basically because that will be largely harmonous key (many notes are shared)
 +
: ...which is why e.g. the [[circle of fifths]] is a thing
 +
 +
This gets into western music theory (which, note, is often slightly biased by typical use of the time),
 +
because composition in songs often sounds more interesting exactly because you do unusual things
 +
 +
 +
 +
Since this is frequently used for DJing, key compatibility matters.
 +
Since the actual key names still need some music-theory interpretation (consider e.g. all the above mistakes, exactly because they are compatible)
 +
 +
An easier thing to understand at a glance is to not use the key itself but filter it through the Camelot Wheel,
 +
which gives a code where adding/subtracting a number is a step in the circle of fifth, A and B are parallel keys (major and minor). In other words,
 +
 +
 +
which is basically just filtering keys through the circle of fifths - twelve numbers and A/B,
 +
reducing the suggestion mostly to "shift one number at a time and/or between A/B"
 +
(though also allowing little else)
 +
 +
 +
 +
 +
Transcribing audio to midi is also nontrivial.
 +
 +
For harmonically simpler sounds autocorrelation methods work decently.
 +
 +
But a lot of sounds are more complex, meaning you're looking for fundamental frequency.
 +
 +
 +
 +
 +
 +
 +
https://en.wikipedia.org/wiki/Pitch_detection_algorithm
 +
 +
https://stackoverflow.com/questions/14734644/is-there-an-algorithm-to-get-the-scale-and-key-of-a-song-from-a-series-of-notes
 +
 +
https://www.google.co.uk/search?q=key%20finding%20algorithms&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla%3aen-GB%3aofficial&client=firefox-a
 +
 +
https://ccrma.stanford.edu/~craig/papers/01/icmc01-harmony-2up.pdf
 +
(keyscape is interesting idea)
 +
 +
http://extras.humdrum.org/man/keycor/
 +
 +
"{{search|Localized Key-finding: Algorithms and Applications}}"
 +
 +
https://music.stackexchange.com/questions/70214/why-is-automatic-key-detection-hard
 +
 +
 +
 +
* Gerhard (2003) "{{search|Pitch Extractionand Fundamental Frequency:History and Current Techniques}}"
 
-->
 
-->
  
Line 412: Line 532:
 
-->
 
-->
  
=On fingerprinting=
+
=On fingerprinting and identification=
 +
 
 +
 
 +
* http://labrosa.ee.columbia.edu/projects/coversongs/ [http://www.ee.columbia.edu/~dpwe/pubs/EllisP07-coversongs.pdf paper] [http://www.music-ir.org/mirex2006/index.php/Audio_Cover_Song task]
 +
* http://www.foosic.org/libfooid.php
 +
* http://werner.onlinux.be/Papers/bpm04/index.html
 +
 
 +
 
 +
 
 
==Analysis and/or fingerprinting==
 
==Analysis and/or fingerprinting==
  
Line 521: Line 649:
  
 
===Acoustid notes===
 
===Acoustid notes===
{{stub}}
 
 
 
'''Acoustid''' is the overall project.
 
'''Acoustid''' is the overall project.
  
'''Chromaprint''' is the fingerprinting part (fpcalc the standalone fingerprinter, which hashes the start of a file).
+
'''Chromaprint''' is the fingerprinting part.
 +
The standalone fingerprinter is called <tt>fpcalc</tt> (which hashes the start of a file).
  
Used by MusicBrainz {{comment|(based on submission, e.g. via [https://picard.musicbrainz.org/ Picard], [http://www.jthink.net/jaikoz/ Jaikoz], or anything else that uses the API)}}, making it interesting for music identification and tagging of entire music files.
+
Licenses[https://acoustid.org/license]:
 
+
Licenses:
+
 
: The client is LGPL
 
: The client is LGPL
 
: the server is MIT license
 
: the server is MIT license
Line 535: Line 660:
  
  
See also:
+
Used e.g. by MusicBrainz {{comment|(based on submission, e.g. via [https://picard.musicbrainz.org/ Picard], [http://www.jthink.net/jaikoz/ Jaikoz], or anything else that uses the API)}}, making it interesting for music identification and tagging.
[http://oxygene.sk/lukas/2011/01/how-does-chromaprint-work/ details on how it works]
+
 
+
  
  
  
 +
<!--
 
Semi-sorted:
 
Semi-sorted:
  
Line 547: Line 671:
 
* calculate chromaprint
 
* calculate chromaprint
 
: mostly meaningful for lookup in acoustid database
 
: mostly meaningful for lookup in acoustid database
<!--
 
 
: The raw fingerprinter wants 22050Hz mono float PCM (fpcalc is linked with things that convert most known audio to that)  
 
: The raw fingerprinter wants 22050Hz mono float PCM (fpcalc is linked with things that convert most known audio to that)  
 
: ...apparently at least 20 seconds of it.
 
: ...apparently at least 20 seconds of it.
-->
 
 
  
  
  
 
: '''API calls:''' (limit to 3/sec, and see also http://acoustid.org/webservice )
 
: '''API calls:''' (limit to 3/sec, and see also http://acoustid.org/webservice )
* submit acoustid
 
: with or even without your file's tags. Basically for statistics of what's out there. {{comment|(AcoustID fingerprinter is a program that makes this simpler)}}
 
: you can wait for it to be processed (by default it will return when added to the queue, which is ''usually'' a few seconds of work)
 
: (requires registration mostly for for statistics, quality filtering)
 
 
 
* get status of submitted acoustid(s)
 
: for when you submitted without waiting, but want to know
 
 
  
* look up chromaprint to acousttid track
+
* look up chromaprint to acoustid track
 
: will return a list of ('''acoustid track ID''', certainty)
 
: will return a list of ('''acoustid track ID''', certainty)
 
: optionally further metadata (essentially adds the next call:)
 
: optionally further metadata (essentially adds the next call:)
Line 578: Line 690:
  
  
To understand the meaning of what the last three can returned, you probably want to think from MusicBrainz's perspective.
+
* submit acoustid
 +
: with or even without your file's tags. Basically for statistics of what's out there. {{comment|(AcoustID fingerprinter is a program that makes this simpler)}}
 +
: you can wait for it to be processed (by default it will return when added to the queue, which is ''usually'' a few seconds of work)
 +
: (requires registration mostly for for statistics, quality filtering)
  
'''Musicbrainz's model:''' [https://musicbrainz.org/doc/MusicBrainz_Database/Schema]
+
* get status of submitted acoustid(s)
 +
: for when you submitted without waiting, but want to know
 +
 
 +
 
 +
 
 +
 
 +
To understand the meaning and usefulness of AcoustIDs, you probably want to think from the perspective of MusicBrainz's and acoustid's data model
 +
 
 +
'''Musicbrainz's'''[https://musicbrainz.org/doc/MusicBrainz_Database/Schema]:
 
: a '''recording''' is acoustically unique (a specific recording/mix)
 
: a '''recording''' is acoustically unique (a specific recording/mix)
 
: a '''release''' is something you can buy (CD, LP, single, relreleases, etc.)
 
: a '''release''' is something you can buy (CD, LP, single, relreleases, etc.)
Line 587: Line 710:
  
  
'''Acoustid's model''' centers around tracks (in MusicBrainz's model that would be a recording)
+
'''Acoustid's''' centers around tracks (in MusicBrainz's model that would be a recording)
  
 
Different enumerated things:
 
Different enumerated things:
Line 593: Line 716:
 
* recordings  
 
* recordings  
 
* fingerprints (fingerprint ID, basically enumerating unique fingerprint submissions)
 
* fingerprints (fingerprint ID, basically enumerating unique fingerprint submissions)
 +
  
 
For example, see https://acoustid.org/track/9ff43b6a-4f16-427c-93c2-92307ca505e0 - at the time of writing (different now),
 
For example, see https://acoustid.org/track/9ff43b6a-4f16-427c-93c2-92307ca505e0 - at the time of writing (different now),
Line 606: Line 730:
  
 
All this mostly matters because you can ask acoustid for MB details,  
 
All this mostly matters because you can ask acoustid for MB details,  
and you have to decide what sort of details you care about - how far you want to resolve this.
+
and ''you'' have to decide to what degree you want to resolve this.
  
E.g. when tagging with Picard, you may wish to gather related files into the same release so they get consistent release IDs, looking at various other tag details, release variants.
+
E.g. when tagging, you might choose to combine this with musicbrainz's metadata to see what release the combined set fits into best, by looking at other tag details. (Picard does a simple form of this)
  
 
When you are building a music player and just want to look up the artist and title ''text'' you can ignore the structure of the MB details you get back.
 
When you are building a music player and just want to look up the artist and title ''text'' you can ignore the structure of the MB details you get back.
 
+
-->
  
  
 
See also:
 
See also:
 +
* https://acoustid.org/
 +
 
* http://acoustid.org/chromaprint
 
* http://acoustid.org/chromaprint
 +
* http://oxygene.sk/lukas/2011/01/how-does-chromaprint-work/
 +
 
* https://bitbucket.org/acoustid/
 
* https://bitbucket.org/acoustid/
  
Line 621: Line 749:
 
{{stub}}  
 
{{stub}}  
  
'''Echonest''' is the whole company.
+
tl;dr: pointless (for consumers)
  
'''Echoprint''' is produced by its acoustic code generator (codegen), which was open sourced in 2011.
 
Their metadata storage/searching server is also available.
 
  
  
 +
'''Echonest''' is the company.
  
Echonest's '''data''' is owned by them but publicly available - the license basically says "if you use our data and add to it, you must give us your additions".
+
'''Echoprint''' is a fingerprint-like thing, produced by its acoustic code generator (codegen),
 +
which was open sourced in 2011.
 +
 
 +
Their metadata storage/searching server is also available.
  
They also have a lot of metadata and fingerprints, and a public service to look up songs from ~20 seconds of audio, and would often work on microphone-based recording.
+
Echonest's '''data''' is owned by them, but publicly available - the license basically says "if you use our data and add to it, you must give us your additions".
  
 +
They also have a lot of metadata and fingerprints.
  
In late 2014 (and basically because Spotify had bought echonest), echonest's closed that service.
 
  
You can still look at their metadata, you can still use their data, codegen is still available (being MIT-licensed code), you can build your own from their components, but you can no longer use their data or their lookup.
+
However, their service -- to look up songs from ~20 seconds of audio -- was closed in late 2014,
 +
basically because Spotify had bought echonest.
  
 +
You can still look at their metadata, you can still use their data, codegen is still available (being MIT-licensed code), so would have to build your own database/search service from their components.
  
  
Line 703: Line 835:
 
{{stub}}
 
{{stub}}
  
Fooid is a fairly simple, and [[FOSS]] music fingerprinting library, allowing fuzzy comparisons between songs.  
+
Fooid is a fairly simple, and [[FOSS]] music fingerprinting library.
 +
It's mostly a simplified spectrogram,
 +
allowing fuzzy comparisons between songs and is pretty decent at near-duplicate detection.
  
It seems to work pretty well for near-duplicate detection.
 
  
 
While still available, it seems defunct now? (website's been dead for a while)
 
While still available, it seems defunct now? (website's been dead for a while)
 +
 +
foosic seems related?
  
  
<!--
 
 
====What a signature represents====
 
====What a signature represents====
To summarize: libfooid takes the first 90 seconds (skipping silence at the start), resamples to 8KHz mono, and does an FFT.
+
To summarize: libfooid  
 +
: takes the first 90 seconds (skipping silence at the start)
 +
: resamples to 8KHz mono (helps reduce influence of sample-rate differences, high frequency noise, and some encoder peculiarities)
 +
: does an FFT
 +
: sorts that into 16 [[Bark]] bands
  
The resampling is done from a lowest common denominator idea: it should make it insensitive to some less important details - sample-rate differences,
 
noise/differences in high frequencies (codecs and encoders ''can'' noticably differ in the content up there),
 
and some encoder peculiarities.
 
  
The FFT data ends up being used as strengths in 16 [[Bark bands]]. Since these strengths are about to be packed, it is first rescaled (based on real-world music) so that the 2 bits it will end up with will be relatively distinguishing (assuming the music sample was representative, standard statistics disclaimer, yadda yadda).
+
Since these strengths are about to be packed into few bits (name 2 bits), it is first rescaled so that the most typical variation will will be relatively distinguishing (based on a bunch of real-world music).
  
Per frame, the information you end up with is the strength in 16 Bark bands (2-bit), and which band was dominant in this frame{{verify}}.
+
Per frame, the information you end up with is:
-->
+
* the strength in 16 Bark bands (2-bit),  
 +
* which band was dominant in this frame{{verify}}.
  
<!--
 
 
====Fingerprint and matching====
 
====Fingerprint and matching====
 +
 
A full fingerprint consists of 424 bytes (printable as 484-character hex), consisting of
 
A full fingerprint consists of 424 bytes (printable as 484-character hex), consisting of
 
* A 10-byte header, recording
 
* A 10-byte header, recording
 
** version (little-endian 2-byte integer, should currently be zero)
 
** version (little-endian 2-byte integer, should currently be zero)
** song length (little-endian 4 byte integer: length in hundreths of seconds)
+
** song length in hundreths of seconds (little-endian 4 byte integer)
 
** average fit (little-endian 2 byte integer)
 
** average fit (little-endian 2 byte integer)
 
** average dominant line (little-endian 2 bytes integer)
 
** average dominant line (little-endian 2 bytes integer)
Line 735: Line 871:
 
** dom: a 6-bit value denoting the dominant spectral line
 
** dom: a 6-bit value denoting the dominant spectral line
  
Fit and dom are non-physical units in a fixed scale (in average too, and different there), directly comparable between fingerprints.
+
Fit and dom are non-physical units in a fixed scale (differentin the averages), so that they are directly comparable between fingerprints.
  
  
Line 754: Line 890:
 
====See also====
 
====See also====
 
* http://foosic.org/libfooid.php
 
* http://foosic.org/libfooid.php
-->
+
 
 +
* forks on github
  
 
===MusicIP, MusicDNS, AmpliFIND===
 
===MusicIP, MusicDNS, AmpliFIND===
 
{{stub}}
 
{{stub}}
  
Proprietary, and its latest lifecycle seems to be a licensed B2B service without public interface, which makes it less than interesting for most projects.
+
Proprietary, and its latest lifecycle seems to be a licensed B2B service without public interface.
 +
 
  
  
Line 772: Line 910:
 
* Their servers - for comparison, and which returned PUID (Portable Unique IDentifiers) on close-enough matches
 
* Their servers - for comparison, and which returned PUID (Portable Unique IDentifiers) on close-enough matches
 
* a client library - which generates an acoustic summary and queries using it
 
* a client library - which generates an acoustic summary and queries using it
 
The acoustic part is proprietary (The MusicDNS client library implements Open Fingerprinting Architecture, but this is only about the querying).
 
  
 
When an acoustic query to their databases matches something closely enough, a PUID is returned,
 
When an acoustic query to their databases matches something closely enough, a PUID is returned,
 
which seems to be a randomly generated identifier (not a fingerprint).
 
which seems to be a randomly generated identifier (not a fingerprint).
 +
 +
All the interseting parts are proprietary.
 +
The MusicDNS client library implements 'Open Fingerprinting Architecture', but this is only about the querying, which is sort of useless without the acoustical analysis, lookup method, or the data.
  
 
===Relatable TRM===
 
===Relatable TRM===
Line 830: Line 969:
 
If it can't, it will recalculate it.
 
If it can't, it will recalculate it.
 
-->
 
-->
 
 
  
 
=Unsorted=
 
=Unsorted=

Latest revision as of 15:13, 11 February 2020

This page is in a collection about both human and automatic dealings with audio, video, and images, including


Audio physics and physiology

Digital sound and processing


Image

Video

Stray signals and noise


For more, see Category:Audio, video, images

Physical effects and/or fairly well studied

Attenuation

Difference in energy (amplitude) of signal.


Attenuation in the widest sense refers to the concept in physics where loss of energy (i.e. amplitude reduction) occurs in a medium (be it electronic equipment, a wall affecting your wifi signal, or what happens when you hear yourself chew).

Attenuation is often measured in decibel.

In some contexts it is decibel per length measure or such, for example to specify expected signal loss in electrical wiring, or perhaps in sound isolation.


In electrical signal transmission, it can refer to problems relating to analog transmission over larger distances, and can be related to the expectable SNR (though there are more aspects to both signal and noise in transmission).


Physical attenuation often also varies with frequency, in which case you can make a graph, or give an average in the most relevant frequency region.

For example,

  • attenuation is the major reason we hear our own voice differently on recordings: we hear a good part of the lower frequencies through our body, while others only hear us through air (another reason is that some frequencies make it more directly to our ears)
  • microphones with stands made just of hard materials throughout are likely to pick up the vibrations of the things they stand on, which anything or anyone not in direct contact won't hear
  • materials used for sound insulation can be seen as bandstop filters (often relatively narrowband)


See also:

Tone versus noise content

Reflection, absorption, echo, reverb

Sound hitting a hard surface will be reflected.

Larger rooms are likely to be mostly hard (and also to have reverb)


An echo is an easily identifiable and usually quite singular copy of a sound, arriving later because it was reflected.

The delay is a significant aspects. Near walls, it is minimal, and you may easily receive more energy from reflections than from a source directly. (also note that localization is not affected horribly much)


When many echoes combine to be blurred and hard to identify, this is called reverb.


Sound field descriptions

Note that:

  • These describe environments instead of sound qualities,
...yet often still relate to qualities, like how many relate to reverb somehow.
  • 'Sound field' usually refers to a specific area (or rather volume)
  • Note that some of these are more standardized terms (see e.g. ISO 12001) than others.


A free field refers to environments where sound is free to propagate without obstruction. (In practice the most relevant objects are reflective surfaces (like walls), so free field often used to mean lack of reverb - but also other implied effects such as room modes)


Direct field is the part of a sound field that has no reflections.

Reverberant field is the part of the sound field that has some reflections

A diffuse field describes an environment with no preferred direction. A specific (and common enough) case of that is that there are so many reflections that it's more or less uniform. (can also be used to refer to light and other EM)


Most rooms are reverberant / diffuse, with the world of variation. For example, empty rooms, cathedrals, and gyms and such have noticeably more reverb than rooms filled with randomly shaped and/or soft objects to scatter or absorb sound.

Anechoic chambers are rooms that attempt to remove all echo and reverb, to simulate a space with only the source in question, and at the same time have the environment act as a free field. It is typical to see porous, wedge-shaped sound absorbers (in part because the alternative is to have a huge space - and still some absorption).


Near field is the area around an emitter close enough where the area of emitter still matters (since all of it emits the sound), via interference and phase effects, and that physically, the pressure sound pressure and velocity are not in phase.

This also tends to imply the volume-per-distance dropoff (usually 6dB per increase) goes a little funny close to an object
size of the near field varies with frequency and sound source size
which is e.g. relevant for microphones specifically used for nearby voices
A near-field monitor (which should actually be called direct field monitors, but studio engineer consider the two the same thing) means placing speakers near you so that most sound is without room reverb - which is important in mastering/mixing


Far field is "far enough that the near field effect doesn't apply". Note that there will be a transition between the two, and where that is depends on frequency.

Resonance

Diffraction

Amplitude modulation (a.k.a. tremolo)

Frequency modulation (a.k.a. vibrato)

Amplitide envelope (attack, decay, sustain, release)

(also in terms of attention)

http://en.wikipedia.org/wiki/ADSR_envelope


Harmonic content

Beat and tempo

The terminology around beat is often used a little fuzzily, and some of it matters more to performance or rhythmic feel, so in more basic description you care first about pulse, the regularity of the beats regardless of precise rhythmic use.


Which, for a lot of techno and other electronic music, is just every beat. For some other music styles it is a somewhat more complex thing, with short-term and longer-term patterns. Which sometimes get so crazy humans have trouble describing it, or even feeling it.


The tempo of most music lies within the 40-200 beats per minutes (BPM). The median varies with music style, but often somewhere around 105 BPM.





Computing BPM

The simplest form to detect tempo of music is to focus entirely on the punchy bassy beat.


The simplest form of that may be to do some heavy lowpassing/bandpassing (leaving mainly 50-100Hz) and look for onsets, where onsets are the start of a longer-lasting low-frequency thing, specifically its sudden increase in amplitude.

Onsets are a nice sounding approach because they take away a lot of complex frequency stuff, and also allow you to focus on the slower stuff - after all, 60BPM is one thing per second and 180 BPM still just looking at 300ms-long things.


Now, most of that may be relatively simple on a punchy beat, but harder on more complex sound. Research into human judgment of onsets is complex and ongoing.


Onsets don't always match the perception of tempo anyway - consider e.g. blues with guitars, where fast strumming being clear and periodic onsets, hich would easily make algorithms decide a factor higher than most humans would due to the style.


Methods may implicitly assume a straight beat, and fall apart around blues shuffles, swing, use of triplets, stronger off-beats, syncopation, and basically any more interesting rhythm,

Some of that can be fixed by trying to detect the pulse, with some basic assumptions.

And if you're going to try to detect measures/bars, then you probably want to consider downbeat detection, detecting which beat is first in each measure.

All this involves more and more music theory and assumptions.


Approaches include

  • Onset detection plus post-processing
Most onsets are easy to detect
Not all music has clear onsets
Not all tempo is defined by onsets
Changing tempo makes things harder


Autocorrelation of energy envelope(s)

Puts some emphasis on
overall energy envelope is poor information; for it to work on more than techno you would probably want at least a few subbands


Resonators (of energy envelopes)

similar to autocorrelation, though can be more selective (verify)
can be made to deal with tempo changes
based on recent evidence, so start of song is always poor guess due to no evidence (though there are ways around that, and in some applications it does not matter)
Related articles often cite Scheirer (1997), "Tempo and beat analysis of acoustic musical signals"
...notes that people typically still find the beat when you corrupt music to six subbands of noise that still have the amplitude of the musical original (but not when you reduce it to a single band, i.e. just the overall amplitude), suggesting you could typically work on this much-simplified signal.
roughly: six amplitude envelopes, differentiated (to see changes in amplitude), half-wave rectified (to see only increases), and comb filters used as tuned resonators some of which will phaselock, then somewhat informed peak-picking
...the tuned resonator idea inspired by Large & Kolen (1994), "Resonance and the perception of musical meter"


Chroma changes

to deal with beat-less music (verify)


Goto & Muraoka (1994), "A Beat Tracking System for Acoustic Signals of Music"

suggests a sort of multi-hypothesis system looking at several



Beatgraph

More of a visualization than a beat analysis?
a column is a sing bar worth of amplitude
Used e.g. in bpmdj
http://werner.yellowcouch.org/Papers/beatgraphs12/


Tempogram:

local autocorrelation of the onset strength envelope.


Cyclic tempogram

Grosche (2010), "Cyclic Tempogram - A Mid-Level Tempo Representation for Musicsignals"



TODO: look at

Goto (2001) "An Audio-based Real-time Beat Tracking System for Music With or Without Drumsounds"
Dixon (2001) "Automatic extraction of tempo and beat from expressive performances"
Dixon (2006) "Onset Detection Revisited"
Alonso et al. (2004) Tempo and Beat Estimation of Musical Signals"
Collins (2012) A Comparison of Sound Onset Detection Algorithms with Emphasis on Psychoacoustically Motivated Detection Functions

-->

Musical key

Computing musical key

Less studied, less well defined, and/or more perceptual qualities

Humans are quick to recognize and follow other properties, better than algorithmic approaches. They include:

(Timbre)

Timbre often appears in lists of sound qualities, but it is very subjective and has been used as a catch-all term that generally it means something like "whatever qualities allow us to distinguish these two sounds (that are similar in pitch and amplitide)".

A large factor in this is the harmonic/overtone structure, but a lot more gets shoved in.


tonal contours/tracks (ridges in the spectrogram)

(particularly when continuous and followable)


Spectral envelope; its changes

microintonation

Some different sounds / categories

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

There are various typologies of sounds, but many are very subjective in that they are not unambiguously resolvable to signal properties -- they are often somewhat forced.


Consider:

  • continuous harmonic sounds, such as sines and other simple predictable signals
  • continuous noise (unpredictable in the time domain)
  • impulses (short lived)

Pulses, noises, and tones cold be seen as some simpler extremes in a continuum, wherevarious inbetweens could be described, such as:

  • tonal pulses / wavelets
  • tonal/narrow-band noise
  • pulsed noise bursts
  • chirp
  • various real-world noises, such as
    • rustle noise [1]
    • babble noise

You can argue about the perceptual use of these categories as they do not distinguish the same way we do.


Some useful-to-know music theory

On fingerprinting and identification


Analysis and/or fingerprinting

See also

http://en.wikipedia.org/wiki/Acoustic_fingerprint
[1] Cano et al. (2002) "A review of algorithms for audio fingerprinting"
[2] Wood (2005), "On techniques for content-based visual annotation to aid intra-track music navigation"


Software and ideas

This list focuses on software and ideas that a project of yours may have some hope of using. There are more (see links below) that are purely licensed services.


Acoustid notes

Acoustid is the overall project.

Chromaprint is the fingerprinting part. The standalone fingerprinter is called fpcalc (which hashes the start of a file).

Licenses[2]:

The client is LGPL
the server is MIT license
the data is Creative Commons Attribution-ShareAlike (verify)


Used e.g. by MusicBrainz (based on submission, e.g. via Picard, Jaikoz, or anything else that uses the API), making it interesting for music identification and tagging.



See also:

Echoprint notes

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

tl;dr: pointless (for consumers)


Echonest is the company.

Echoprint is a fingerprint-like thing, produced by its acoustic code generator (codegen), which was open sourced in 2011.

Their metadata storage/searching server is also available.

Echonest's data is owned by them, but publicly available - the license basically says "if you use our data and add to it, you must give us your additions".

They also have a lot of metadata and fingerprints.


However, their service -- to look up songs from ~20 seconds of audio -- was closed in late 2014, basically because Spotify had bought echonest.

You can still look at their metadata, you can still use their data, codegen is still available (being MIT-licensed code), so would have to build your own database/search service from their components.


The Echo Nest



See also:

pHash notes

A few algorithms, for image, video, audio. See http://www.phash.org/docs/design.html

Audioscout is based on the audio

See also:


Audioscout

See also:

fdmf

http://www.w140.com/audio/


last.fm's fingerprinter

Combination of fingerprinter and lookup client. Available as source.

Fingerprinter is based on [3]


Fingerprinter license: GPL3

Client lookup: "Basically you can do pretty much whatever you want as long as it's not for profit."

See also:

Fooid notes

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Fooid is a fairly simple, and FOSS music fingerprinting library. It's mostly a simplified spectrogram, allowing fuzzy comparisons between songs and is pretty decent at near-duplicate detection.


While still available, it seems defunct now? (website's been dead for a while)

foosic seems related?


What a signature represents

To summarize: libfooid

takes the first 90 seconds (skipping silence at the start)
resamples to 8KHz mono (helps reduce influence of sample-rate differences, high frequency noise, and some encoder peculiarities)
does an FFT
sorts that into 16 Bark bands


Since these strengths are about to be packed into few bits (name 2 bits), it is first rescaled so that the most typical variation will will be relatively distinguishing (based on a bunch of real-world music).

Per frame, the information you end up with is:

  • the strength in 16 Bark bands (2-bit),
  • which band was dominant in this frame(verify).

Fingerprint and matching

A full fingerprint consists of 424 bytes (printable as 484-character hex), consisting of

  • A 10-byte header, recording
    • version (little-endian 2-byte integer, should currently be zero)
    • song length in hundreths of seconds (little-endian 4 byte integer)
    • average fit (little-endian 2 byte integer)
    • average dominant line (little-endian 2 bytes integer)
  • 414 bytes of data: 87 frames worth of data (each frame totals 38 bits, so the last six bits of those 414 bytes are unused. For each frame, it stores:
    • fit: a 2-bit value for each of 16 bark bands
    • dom: a 6-bit value denoting the dominant spectral line

Fit and dom are non-physical units in a fixed scale (differentin the averages), so that they are directly comparable between fingerprints.


The header is useful in itself for discarding likely negatives - if two things have a significantly different length, average fit, or average line, it's not going to be the same song (with some false-negative rate for different values of 'significantly').

You can:

  • do some mildly fuzzy indexing to select only those that have any hope of matching
  • quickly discard potentials based on just the header values
  • get an fairly exact comparison value by decoding the fingerprint data and comparing those values too.


With the detailed comparison, which yields a 0.0..1.0 value, it seems that(verify):

  • >0.95 means it's likely the same song
  • <0.35 means it's likely a different song
  • inbetween means it could be a remix, in a similar style, or just accidentally matches in some detail (long same-instrument intro)


See also

  • forks on github

MusicIP, MusicDNS, AmpliFIND

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Proprietary, and its latest lifecycle seems to be a licensed B2B service without public interface.


The company was first called Predixis, later and best known as MusicIP (~2000), died in 2008, relaunched as AmpliFIND(verify) Music Services (in 2009?), sold its intellectual property to Gracenote (2011? 2006?).


Probably most known for the MusicDNS service (which was at some point rebranded as AmpliFIND(verify), which mostly consists of:

  • Their servers - for comparison, and which returned PUID (Portable Unique IDentifiers) on close-enough matches
  • a client library - which generates an acoustic summary and queries using it

When an acoustic query to their databases matches something closely enough, a PUID is returned, which seems to be a randomly generated identifier (not a fingerprint).

All the interseting parts are proprietary. The MusicDNS client library implements 'Open Fingerprinting Architecture', but this is only about the querying, which is sort of useless without the acoustical analysis, lookup method, or the data.

Relatable TRM

Proprietary.

Used by Musicbrainz for a while, which found it useful to find duplicates, but its lookup had problems with collisions, and scaling (meaning its server was unreliably slow), and Relatable did not seem to want to invest in it, so its use in MusicBrainz was replaced.


http://www.relatable.com/tech/trm.html


MusicURI

See also




Unsorted

Moodbar

Assigns a single color to fragments within music, to produce a color-over-time thing that informs of the sort of sound.


Mostly a CLI tool that reads audio files (using gstreamer) and outputs a file that essentially contains a simplified spectrogram.


Apparently the .mood generator's implementation

  • mainly just maps energy in low, medium, and high frequency bands to blue, green, and red values.
  • always outputs 1000 fragments, which means
useful to tell apart parts of songs
visual detail can be misleading if time-length is significantly different
not that useful for rhythmic detail for similar reasons


Something else renders said .mood file into an image, e.g. Amarok, Clementine, Exaile, gjay (sometimes with some post-processing).

The file contains r,g,b uint8 for each of the (filesize/3) fragments.


See also: