Haar-like features

From Helpful
Jump to navigation Jump to search
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Haar-like features

Haar-like features considers adjacent rectangles in a detection window, sums up the pixel intensities in each region and uses the difference between these sums to categorize subsections.

It's called Haar-like because this blocky-rectangle setup resembles Haar basis functions. There is no direct link to wavelets, Haar or otherwise.


Haar-like detectors in isolation are fast but very weak detectors, so are typically combined with similar detectors in increasing levels detail, in what is called a Haar cascade.

By the time such a setup is little over-complete, and wrapped in the Viola-Jones object detection framework, and with a lot of training, you can get pretty decent object detection.

Because it's run from through a set of classifiers from broad to fine detail, each level is quick to test and at each level we can say "actually it doesn't look like our feature enough, stop now" - it's set up to quit early, so we spend CPU proportional to how like the target object something is, and little where there is no such object visible.

When properly trained and tweaked, it has a surprising speed/performance balance for how simple it seems.


Another detail that helps speed is that it calculates an integral image[1], the result of which lets us do quick sums over arbitrary subregions, and thereby quickly the average in image subregions. This helps the above, because feature calculations need average intensity, and it's faster with this than recalculating for every possible window.


How well Haar cascades work, and how large the training set needs to be, depends on the objects you want to detect. Light-colored faces, for example, have some consistency in intensity properties - eyes are darker than their surrounding, cheeks are lighter.

However, given the very real variation in face shape and positioning, you need a large training set, and other skin tints make things more interesting too.


On object scale, and parameters

The detector works on the pixel size of your training data.

Detecting the same thing at different scales are detected by downsampling the image, usually guided by a given maximum and minimum detection size.

These sizes relate to CPU use and goals
The rescale factor to use relates to CPU use and accuracy


Note that during processing, detections will likely have many detections a few pixels from each other. Which you can use as a condition for a robust single detection.

Parameter-wise, this also depends a little on choice of scale factor (and minimum size)


Improving detection speed

The easiest two solutions:

  • scale down the image
...as much as sensible for the job and the accuracy you care for.
  • restrict the search area
For example, for detecting the most central face in video you can often assume the face is in the middle 50% of the frame
If/once you know you're tracking one face, you can typically assume it does not move much from its last position




Haar training

See: