Out-Of-Band Management notes

From Helpful
(Redirected from Ipmi)
Jump to navigation Jump to search
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Out-of-band management can be seen as a tiny computer (as part of the motherboard, or as a plug-in module) that is observing the larger computer you really use.


It allows things like:

  • monitoring sensors
sometimes regardless of whether it's powered on or off
  • logging hardware errors
  • the ability to remotely control power, i.e reboot
  • the ability to remotely observe (and interact with) the boot process
...and control of the BIOS/EFI, option-ROM firmware, etc.

...and, in particular, the option to do most of that remotely, not requiring physical access.



It's centered around a BMC ((Base)Board Management Controller), though implementations around that vary in details and features, and tools.


And names.

IPMI is the relatively OEM-like variation

certain brands have their own solutions, often in part built on IMPI(verify), including:

Dell's DRAC / iDRAC
HP's iLO (Integrated Lights-Out)
Sun's LOM port (Lights Out Management port)
IBM's RSA (Remote Supervisor Adapter)


And recently

Intel's AMT, aimed at regular computers rather than servers
seems a subset of server BMC features (verify)



IPMI (and extensions)

Enabling it

Servers usually have it enabled by default, workstations with server motherboards less often.


Depending on the setup, parts of it may be disabled and therefore not OS-accessible (or, if ethernet is also not set up, not accessible at all).

For example, in some boards you won't be able to talk to your own BMC until you enable the COM ports as well as set Remote Console to one of them (verify)


Connecting to it

readout

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

The interesting things are mostly in the SEL (System Event Log)

sudo ipmitool sel elist

...and the SDR (Sensor Data Repository, more of a realtime thing).

sudo ipmitool sdr

status column:

ok is okay
ns means no sensor
nc is non-critical error
cr is critical error
nr is non-recoverable error


Fan speeds

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


tl;dr:

  • fans in server motherboards are typically BMC-controlled (specifically the SDR?(verify)) which makes sense since it's also in charge of the sensors.
  • IPMI does not expose much control
  • The only thing you can change is the fan thesholds
essentially a per-sensor "is this cause for more cooling?" thing.
  • a little frustrating when you want more control, or when it seems to integrate less than it could (e.g. only measures the memory temperature, while not caring about the CPU or disks, meaning that if it's not cooling as much as is good, the OS still has to throttle down the CPU)


Note:

  • not all boards have useful sensors - some just have a baseboard/environment sensor and do not read out the CPU or anything else
  • Fan RPM is typically also sensed
When you have a low-RPM fan, that can cause the response "oh, that probably failed" leading to emergency cooling, i.e. all fans to go to full speed.
which in the case of a low-speed fan will in itself resolve that sensor reading. The result is that the host flip-flops between loud critical cooling and a presumably much more quiet mode.



Problems

Could not open device at /dev/ipmi0, ...

Two common and simple cases:

  • I forgot to load the kernel modules
  • I forgot to run as root (sudo)

In short:

modprobe ipmi_msghandler
modprobe ipmi_devintf
modprobe ipmi_si

...should fix most cases.


System Event Log full

http://www.thomas-krenn.com/de/wiki/BMC_System_Event_Log_(SEL)_Full_bei_Intel_Server


See also




AMT

See also: