Image processing notes

From Helpful
Jump to: navigation, search
These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.
This page is in a collection about both human and automatic dealings with audio, video, and images, including


Audio physics and physiology

Digital sound and processing


Image

Video

Stray signals and noise


For more, see Category:Audio, video, images

A lot of this is experiment and work in progress, and very little of it has been tested to academic or even pragmatic standards. Don't trust any of it without testing it yourself.

It's also biased to Python, because I like rapid prototyping. I can always write fast code later.


Contents

Noise reduction

gaussian blur

(or other simple interpolating blurs)

Upsides:

  • Simple. Fairly fast.
  • does not introduce spurious detail

Downsides:

  • indiscriminantly removes (high-)frequency content. a.k.a. "Smears everything"

median filtering

Upsides:

  • Simple. Not quite as fast as you'ld think.
  • rejects outliers; best example is rejecting salt and pepper noise
  • will preserve edges better than e.g. linear interpolation

Downsides:

  • can remove high-frequency signal
  • the edge preservation depends on some conditions, so doesn't always happen. The mix can look odd.

total variation denoising

Varies amount of blur by the amount of variation near the pixel.

Which means it mostly lessens noise in otherwise flat regions, while leaving spikes and edges mostly intact.

Upsides:

  • This tends to look more detailed than a basic mean filter, particularly on sharp images

Downsides:

  • Can't really tell what real edges are; for subtler images it can be much like mean


See also:

bilateral denoise

Reduce noise while preserving edges.

Averages based on their spatial closeness and radiometric similarity, and potentially other metrics. Like total-variance denoising in that it easily preserves edges, yet is often more true to photographic original than.


Playing with:

non-local means denoising

See also:


Anisotropic diffusion

See also:

Wiener filter

See also:


Halftoning, dithering

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Ordered dithering

Floyd Steinberg dithering

https://en.wikipedia.org/wiki/Floyd%E2%80%93Steinberg_dithering


Jarvis, Judice, and Ninke dithering

Stucki dithering

Atkinson dithering

Burkes dithering

Sierra dithering

Color conversion

...because some spaces are much more sensible to work in, due to linear distance being closer to perceptual distance than the more standard spaces.


Automatic illuminant correction

The intent is usually to take out any mild tint that the illuminant has, or to correct a camera's mis-estimation of the illiminant.

In other words, mainly white balance correction, a.k.a. gray balance.


Color correction often comes down to

  • estimating what the illuminant probably was
photographers who cares about color accuracy tend to use a gray card to have a known-absolute reference.
without one, it's based on assumptions
and incorrect assumptions can introduce unnatural tinting in the process
  • apply chromatic adaptation so that that illuminant effectively becomes a given illuminant, such as D65 (mid-day sun), or just numerically equalize channels.


There are some more specific cases you could focus on, such as cases where you know color filters were used, or e.g. faded photographs could consider the dyes in use - their relative fading is usually documented.


Gray world

The idea: in a well balanced image, the average color is a neutral gray.

So we scale channels to makes the average become gray, typically implemented by making the average of the red, green, and blue the same, often by a basic linear gain.


The results vary a little with in which color space you do this.

Gray world makes decent sense when large patches are expected to be neutral, which is e.g. true in darker photographs.


Pro:

  • Simple
  • works well on removing the illuminant's tint on images with a lot of white, a lot of dark areas, and/or a lot of each color, or a photograph which had such a neutrality but had a tint applied to it


Con:

  • Only true for images with a roughly white illuminant
if tinted, that you want to fully negate that tint - e.g. images intentionally taken with red light will effectively just ramp up Green and Blue - of which there was very little, so it likely comes out quite unnatural
can be tempered by limiting that gain
  • effectively equally uses all of the image for measurement
an assumption that is flawed to varying degrees



Auto levels

The idea: The brightest color should be white, the darkest color should be black.

So: rescale each channel's histogram to span the full range, accepting that some (e.g. 1%) will lie outside and be truncated.


Pro:

  • brightest color becomes white, which on outside photos is often good enough
  • less sensitive to single-color use than gray world

Con:

  • suffers from similar problems to gray world
  • not sensitive to how much small the brightest pixel-area is. For example, having a bright window in the background means a white wall will be ignored.

Retinex

Retinex is in itself a wider theory / set of explanations dealing with various color constancy effects, also dealing with local context and some human interpretation.


In this (whole-image color correction) context, it mainly says that perceived white tends towards the strongest cone signal.

Which is sort of a gentler-defined variant of gray world, referring to the overall effects rather than a single spot.


It roughly means that the maximum within each channel should be the same. While RGB isn't eye cones, it's close enough to work well.

The correction could be implemented as just linear gain on each channel to make the maxima the same. It seems that instead of using the plain maximum, using a very-high-percentile point is good at ignoring outlier pixels.


Pro:

  • simple idea, simple code
  • better behaved than gray world, in that it avoids many larger color shifts

Con:

  • sometimes too cautious, e.g. does little on overexposed images
  • still makes mistakes on some, e.g. on images with no near-illuminant color, or where the illuminant is quite colored (break the underlying assumption as much as gray world)
can be tempered by limiting the difference between the gains applied - because usually significant different gains makes no sense
...though that in combination with the percentile logic can have some odd side effects

Gray world and retinex

"Combining Gray World and Retinex Theory for Automatic White Balance in Digital Photography" argues that combining the two makes sense.

Which requires a little trickery as linear correction alone cannot satisfy both criteria at once.

Robust Automatic White Balance

Essentially a variant of gray world that is selective about the areas it uses, primarily looking for nearly-white parts, so e.g isn't distracted by the average of the colored parts.


Pro:

  • Doesn't make as many mistakes as plain gray world
  • outside photos usually have such near-whites, so this makes sense for them

Con:

  • images may not have representative near-whites
  • selection of areas to use turns out to be harder than it sounds, depending on how robust you want it to be.


J Huo et al., "Robust Automatic White Balance Algorithm using Gray Color Points in Images"

More reading

http://ipg.fer.hr/ipg/resources/color_constancy

D Nikitenko et al., "Applicability Of White-Balancing Algorithms to Restoring Faded Colour Slides: An Empirical Evaluation"

A Rizzi et al., "A new algorithm for unsupervised global and local color correction"

D Cheng et al., "Illuminant Estimation for Color Constancy: Why spatial domain methods work and the role of the color distribution"

Multiple related images

Median of pixel along set of images

...emphasizing the consistent/still areas, which is typically what you would consider the background.

The common example is "in a still scene with some tourists wandering about, make your camera do one two dozen photos over a minute, and median them", because most of the pixels will be very stable, and the people moving about will be outliers. (note that anyone who was sitting in one spot will probably become blurry because they'll be a composite from multiple photos)


median remove people

Differential image

Often refers to keeping track of a longer-term average, and subtracting individual frames from it.

This takes out everything that's been there consistency (...lately), and highlights details in areas with movement.

One example is stationary traffic videos, focusing on mainly the cars, because it easily removes entirely-static things like the roads, signs, lane detail, and also static-on-the-terms-of-minutes such as lighting gradients.

Superresolution

There are various distinct things called superresolution.


From the information-theoretical view, you can split these into:

image-processing/geometrical superresolution

Basically "what you can do after it's an image", including assumptions/knowledge of what the image sensor and optics do.


Optical/diffractive superresolution

Plays with the diffraction limit of the optics of a system

See e.g. https://en.wikipedia.org/wiki/Super-resolution_microscopy



See also

HDR and exposure fusion

The eyes are good at adapting locally to the amount of light, e.g. seeing details in a dark room even there's also a bright window in our view, in part because our eyes have different areas, and have logarithmic response - and also because we're used to exploiting these specifics, intuitively.

Film and digital sensors aren't good at this. Both because it's so much easier to create linear-response overall-lighting, but also because that makes sense for fast response, wide applicability, and capturing what's there accurately. But yeah, the window scene they suck at - they would probably adjust to the bright window, which would wash out the dark bits, and has no obvious way to cheat to imitate our eyes. (Or adjust to the dark detail, and have one mighty overexposed window)


High Dynamic Range roughly imitate our eyes, by cheating a bit.

You take images with different exposure (e.g. window-nice-and-dark-washed-out, details-in-dark-and-window-way-overexposed), and synthesizes an image that has detail in both areas, roughly by locally weighing the image that seems to give more detail.

Exposure fusion has only the 'piece together more detail' goal.

HDR has more steps, producing an immediate result that has more dynamic range than monitors can show.

When the purpose is human viewing, this is often done by being nonlinear the same way our eyes are. And often a little more aggressively than our own eyes would be, which tends to have a side effects that look like halos/blooming and some areas having unnatural contrast.

Another purpose is reproduction, e.g. in 3D rendering, which preserves HDR throughout enough of the pipeline to make an informed decision about use of the range - which tends to mean you don't wash away details in the darkest or lightest areas. Some of these techniques are now common because they help things look good for relatively little extra processing. Some are fancy and GPU-intense. It's a spectrum. See e.g. HDR rendering


Motion descriptors

Object tracking

Whole-image descriptors

Describes all of the image, as opposed to describing specific features you found.

...there is natural overlap with dense descriptors, which analyse a whole image in independent patchs.

Color descriptors

MPEG-7

A color histogram was used in development, but for many uses this is too high-dimensional.

Color space is settled in each descriptor, because not doing so would hurt interoperability.


Scalable Color Descriptor (SCD)

Defined in HSV


Color Structure Descriptor (CSD)

Defined in HMMD


See also:

Color Layout Descriptor (CLD)

YCbCr space

Dominant colors

Color quantization
Clustering
Merged histogram

Texture descriptors

Texture is harder to quantify than you may expect.


Some methods are much easier to apply to constrained single purposes, e.g. medical imaging, than it is for arbitrary images.

Some things work better in lab conditions (e.g. recognize known textures), some work well enough to e.g. recognize differences in areas in a picture, but to robustly (e.g. scales, rotation, and lighting-invariant) label textures is hard.



MPEG-7

HTD (Homogeneous Texture Descriptor)
idea: use fourier analysis to get basic frequency and direction information
in frequency space (2D FFT amplitudes), divide into 5 octave-style rings, and 6 directions
making for 30 bins
plus two: overall mean and stdev
EHD (Edge Histogram Descriptor)
idea: detect which direction detected edges go, make a histogram of that; useful for overall comparisons.
TBD (Texture Browsing Descriptor)
idea: indicators of perceptual directionality, regularity, and coarseness of a texture
most dominant texture orientations (0 meaning none, 1..6 meaning 0..150 degrees in steps of 30)
second most dominant texture orientations (0 meaning none, 1..6 meaning 0..150 degrees in steps of 30) (optional)
regularity in first orientation
regularity in second orientation (optional)
coarseness
Implementation based on bank of orientation- and scale-tuned Gabor filters.

Grey-level Co-occurrence

Frequency-based

Unsorted

Unsorted

Image entropy

Can mean

  • the overall entropy of a whole block
  • for each pixel, the entropy of the values around it

Roughly an estimate of local contrast / texture.


Image moments

Gist

Feature detection and description

Related tasks

Classical features

The classical set of features are (a subset of) things that happen at few-pixel scale:

  • Points -
  • Blobs - smooth areas that won't (necessarily) be detected by point detection. Their approximate centers may also be considered interest points
  • Edges -
    • a relatively one-dimensional feature, though with a direction
  • Corners - Detects things like intersections and ends of sharp lines
    • a relatively two-dimensional kind of feature
  • Ridges -
  • Interest point - could be said to by any of the above, and anything else you can describe clearly enough
    • preferably has a clear definition
    • has a well-defined position
    • preferably quite reproducible, that is, stable under relatively minor image alterations such as scale, rotation, translation, brightness.
    • useful in their direct image context - corners, endpoints, intersections
  • Region of interest
    • any subrange (1D), area (2D), volume (3D), etc. identified for a purpose.
    • Also often in an annotative sense, not necessarily a machine-proffered one


See also:


Edge detection

  • Canny [1]
  • Differential [2]
  • Canny-Deriche [3]
  • Prewitt [4]
  • Roberts Cross operator [5]
  • Sobel [6]
  • Scharr operator - variation on Sobel that tries to deal better with rotation [Sobel_operator#Alternative_operators]


Old:

  • Marr-Hildreth [7]

Playing with (mostly python)

  • PIL has ImageFilter.FIND_EDGES (convolution-based)


Interest point / corner detection

Blob detection

Laplacian of Gaussian (LoG)
Difference of Gaussians (DoG)
Determinant of Hessian (DoH)
MSER (Maximally Stable Extremal Regions)

Detects covariant regions, areas that are stable connected part of gray levels.

Primarily a region/blob detector. Sensitive to blur.

Decent performance.

See also:


PCBR

Principal curvature-based region detector

https://en.wikipedia.org/wiki/Principal_curvature-based_region_detector

Harris affine

https://en.wikipedia.org/wiki/Harris_affine_region_detector

Hessian affine

https://en.wikipedia.org/wiki/Hessian_affine_region_detector

Dense descriptors

Dense meaning it describes the whole image a patch at a time. As opposed to sparse, meaning for selective areas (often features).

The distinction can be subtle - dense may just mean we don't necessarily assume that we can reliably selecting good features/areas to study.


Any overall descriptor used locally

...color, texture, or such.

Lets you

  • describe the variation of said descriptors within an image
  • focus on areas where things are happening,


Image gradient

At each point in an image, you can calculate where the local gradient is going towards -essentially a vector.

In theory based on the local derivative, in practice a discrete differentiation operator, such as Sobel or Prewitt (or other kernel-style things - actually quite akin to edge detection that isn't particularly tuned to a single direction (as some are).

Kernel-based methods tend to work on at least 3x3 pixel areas, though may be larger depending on application.


https://en.wikipedia.org/wiki/Image_gradient


Histogram of Oriented Gradients (HOG)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Refers to the general idea of locally detecting gradients, which is a concept used by a whole family of algorithms.

And to a fairly specific use, doing this for the entire image, on fixed-size, small cells (e.g. 8x8 pixel).


For each cell, we can build a histogram of how much (magnitude-wise) you saw its parts pointing in each direction (e.g. the 8 basic compass directions) -- with some footnotes like bleeding into adjacent bins to account to be resistant to aliasing.


This may well be the first step in something else, e.g. detection of certain objects by training on results.


Due to being based on differences (plus some normalization), it is fairly resistant to illumination differences.

It is somewhat sensitive to orientation. Due to its nature it's not too hard making it less resistant, though by that time you may find SIFT more interesting.


Variations:

  • R-HOG: rectangular (typically square)
  • C-HOG: circular
  • Fourier HOG
Rotation invariant


See also:

Gist

See also:

Sparse/local descriptors

Sparse meaning in describes local areas, and is selective about what parts, as opposed to doing so for the whole image.

Feature description for things like image comparison is based on the idea that considering all points in an image for description is infeasible, so informative points are chosen instead. The challenge then becomes choosing highly informative and stable points.


SIFT (Scale-Invariant Feature Transform)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

(patented)

Read up on local gradients, particularly HOG.

SIFT continues that idea by analyzing the area around an already chosen point of interest -- often after deciding the rotation and scale of the patch it will be analysing based on local content.(verify)

SIFT is often a first step in something else, such as object recognition (often bag-of-words style), is used to align similar images in cooperation with RANSAC,


See also:

uses color information, giving more stable features around color contrast
uses PCA instead of the gradient histogram, and its output is more compact
GSIFT adds global context to each keypoint (verify)
features are robust to more affine transforms(verify)
  • See also SURF - has a similar goal but uses different methods for most steps
  • See also SPIN, RIFT (but SIFT usually performs better(verify))
  • See also FIND, MIFT



GLOH (Gradient Location and Orientation Histogram)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

(patented)

See also:


SURF (Speeded Up Robust Features)

faster than SIFT, performs similarly


See also


LESH (Local Energy based Shape Histogram)

http://en.wikipedia.org/wiki/LESH


FAST (Features from Accelerated Segment Test)

Mainly a feature detector

E Rosten, T Drummond (2006) "Machine learning for high-speed corner detection"

BRIEF (Binary Robust Independent Elementary Features)

M Calonder et al. (2010) "BRIEF: Binary robust independent elementary features"


ORB (Oriented FAST and Rotated BRIEF)

Offered as an efficient alternative to SIFT (and SURF), and also not patented.

See also:

Unsorted

K Mikolajczyk, C Schmid (2005) "A Performance Evaluation of Local Descriptors"

Combining descriptors

Indexing descriptors and/or making descriptors more compact, for retrieval systems and/or fingerprint-style descriptors often meaning a useful lower-dimensional representation.

Bag-of-features

Fischer vector

Vector of Locally Aggregated Descriptors (VLAD)

Object detection, image segmentation

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

'Object detection' tends to refer to detecting anything more complex than a point, edge, blob, or corner.

Recent study has been into the compositional nature of objects.


Image segmentation splits an image into regions. Depending on the task this can be or help object detection, be/help texture detection, ignore background, separate objects/textures to help process each individually, etc.

Unsorted

  • Structure tensor

Scale space

Scale space is a concept that makes detection of things work at multiple/varied scales.

Roughly speaking, it's a series of images that lowpassed to different degrees, also in part because that makes detected coordinates work on each image.


In practice it can also be scaled down (implies lowpass), if the algorithm it's supporting deals with that more easily (e.g. always looks at few-pixel scale, can't tweak how many). Note that scaledown and lowpass are not identical. A gaussian filter is fairly ideal in terms of frequency information, (which is why scale space is often specifically gaussian scale space), scaledowns can introduce some spurious, jagged-like information (varies with scaledown method). So in some cases the scaledown happens after filtering.


Motivations include:

  • Most current feature recognition works on a small scale (and often in terms of pixels). We'd like to also detect larger objects, without doing complex compositional things.
  • When we look at a scene, the fact that we recognize objects means we look at it at different scales.
e.g. from a distance we might identify the house, close-up we'll look at the door.
  • "when you squint", or see from a distance, or zoom out, that's essentially a lowpass


It turns out that anything that you can do via differentials (such as common feature detectors (edge, ridge, corner, etc.)) can be done without a rescale.


See also:

Stroke Width Transform (SWT)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

For each pixel, finds the likeliest stroke width containing that pixel. Somewhat aware of direction, and often part of letter detection.

Uses edge and gradient map.


Pro:

  • not tied to detecting text of a specific size, can deal with rotation and skew
  • not overly sensitive to background gradients

Con:

  • slow (because of the intermediate maps)
  • Tends to assume hard contrast (and may assume text is much darker)


Hough transform

Finds imperfect versions of regular features like lines (first version did only lines), circles, ellipses. Essentially votes in a feature space.

http://en.wikipedia.org/wiki/Hough_transform


Kernel-based Hough transform (KHT)

On OCR

Transforms mostly used to support others

Morphological image processing

See also:

Whole-image transforms

Gamma compression as a perceptive estimator

bandpass, blur, median

For color analysis we often want to focus on the larger blobs and ignore small details. (though in some cases they can fall away in statistics anyway).


Variance image

Of a single image: Each pixel defined by variance in nearby block of pixels. Good at finding sharp details and ignoring gradients


(Sometimes refers to variance of a pixel within a stack of related images)


http://siddhantahuja.wordpress.com/2009/06/08/compute-variance-map-of-an-image/

Convolution

Fourier transform

Gabor

https://en.wikipedia.org/wiki/Gabor_filter

Difference of Gaussians (DoG)

DoG (Difference of Gaussians) takes an image, makes two gaussian-lowpass-filtered results with different sigmas, and subtracts them from each other.

This is much like bandpass, in that it preserves the spatial information in a range relating to the sigmas/radii.


Often mentioned in the context of edge detection. In that case, there may be further tweaking in the sigmas, and further steps in cleaning and counting zero crossings.


Compare Laplacian of Gaussian.

See also:

Laplacian of Gaussian (LoG)

The Laplacian reacts to local rapid change. Since this makes it very sensitive to noise, it is often seen with some smoothing first, e.g. in the form of the LoG (Laplacian of Gaussian)

Determinant of Hessian (DoH)

http://scikit-image.org/docs/dev/auto_examples/features_detection/plot_blob.html?highlight=difference%20gaussians

Radon transform

Not-specifically-image processing

...that find use here, often generic signal processing.


RANSAC

Kalman filter

Nontrivial goals

Edge-aware transforms

Image registration

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Image registration[13] is the fancy name for "aligning two nearly identical images"


In some cases, e.g. various astronomy, canbe well constrained so you can get a lot of use out of assuming that only a moderate amount of translation (and implied edge cropping) happens, and no scaling and no (or almost no) rotation.

Which is relatively simple and controlled. This is often done with something like cross-correlation and often phase correlation.


It gets more complex if you want to solve for cropping, rotation, uniform and/or non-uniform scale - typically on top of translation. The combination often means you need an iterative approach, and note that this is not a convex problem -- there are potentially local minima that that may not the optimal point, or plain nonsense, so simplex-type solutions will not always work without some guiding assumptions / context-informed filtering.


For example,

  • a bunch of photos from a tripod will usually see no worse than a few-pixel shift, and possibly a tiny rotation.
handheld more of both
  • internet-reposts see a bunch of rescaling and cropping (though rarely rotation)
  • a series of successive frames from an electron microscope may see a shift in the stage
and sometimes of parts of the sample (e.g. in cryo-EM in reaction to the beam)
...yet usually a shift-only, constrained-within-a-few-pixels solution already goes a long way



See also:

Translation-only:


Near-duplicate detection, image similarity, image fingerprinting, etc.

Image segmentation

Quick shift

SLIC

Felzenszwalb

Retrieval systems

Semi-sorted

Image summaries and analysis

Histograms

Co-occurrence matrices

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Studies pixel-level texture. Often of grayscale images, then called GLCM.

With some data massaging it is useful for various other things.


Studying only pairs of adjacent pixels may sound fragile (e.g. noise-sensitive), but works better than you may think, in part because most images are large enough to have a lot of pairs.


There are some limitations and tweaks you should understand before you can apply this robustly.

There is value in pre-processing, but note that e.g. size-down and blur will mostly just move things towards the main diagonal -- which is related to consistent areas and gradients, already pronounced to start with for most images.


The output is a symmetric n-by-n matrix, where

  • each value in the matrix has counted how often the two values (indicated by the row and column of its location) co-occur.
  • n is the quantization level - usually 256 since you would typically run this on 8-bit images. (image quantization beforehand can be useful in some applications)


There are some overall image properties you can calculate from that matrix, including:

  • Contrast (in nearby pixels)
The sum-of-squares variance
  • Homogeneity (of nearby pixels)
Which is basically the same as 'to what degree are the matrix values in the diagonal'
also a good indication of the amount of pixel-level noise
  • Entropy (within nearby pixels)
  • Energy, a.k.a. uniformity
sum of squares
Value range: 0..1, 1 for constant image
  • Correlation
...of nearby pixels, meaning (verify)
  • The amount of bins that are filled can also give you a decent indication of how clean an image is.
...e.g. giving you some indication of whether it has a single solid background color, and where it is on the spectrum of two-level, paletted, clip-art with little-blur, badly JPEG-compressed diagram, photo.
  • Directionality of an image, if you calculate values for different directions and see variance of such values.


Further notes:

  • Direction of the pixel pairs can matter for images with significant noise, high-frequency content, and/or extreme regularity.
Usually the implementation allows you to give either a x,y pixel offset (for e.g. 'take a pixel, and the pixel one to the right and one down') or an angle.
  • Some implementations also allow calculation for various distances. This is quite similar to running the calculation over resized images, but may be somewhat faster.


See also:

Unsorted

Detecting manipulation

Error level analysis

Principal Component Analysis

Demosaic analysis

See also