Electron Microscopy file format notes

From Helpful
Revision as of 10:01, 4 July 2024 by Helpful (talk | contribs) (→‎STAR and CIF files)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

MRC files

MRC as in the image format used in crystallography, EM, and such.


General structure:

  • a 1024-byte header
    • 224 bytes in variables
    • 800 bytes in ten 80-byte labels (how many are used is set by one of the variables, though in most cases unused labels are zeroed out, so you can also just look at the contents)
  • optional additional header section(s). Size and presence is implied in the 1024-byte fixed header.
The format of this header varies with the exact variant
  • data for image(s)
uncompressed array
endianness, pixel format, size, and row/column ordering is determined from the header


Note:

  • The bare minimum is basically the first four fields: dimensions in X, Y, Z, and the MODE (pixel data type).
  • There is a (recentish) convention in that if the file extension is mrcs, this is a stack rather than a volume (when NZ>1)



It's hard to make a reader that takes even all the well-specified variants without some external knowledge. It's impossible to include everyone's conventions over the years.

So it's impossible to take out exactly what its original writer intended put in there. Within the same software package, sure. Within the same field, sort of. In general, good luck. This is why I have some choice words for this whole family of formats, because as a family, it hasn't really talked to each other for over thirty years.


On the other hand, most programs don't use most of the main header. If you use it as a dump-in-a-file format (Dimensions, pixel mode, data, implied endianness), that works pretty well.

And while the extended headers are varied, they are also your best hope for storing more interesting data in a structured way.


MRC main header variants

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


MRC came from the MRC-LMB

The CCP4 used the format in a different way, so had its own conventions.

The two eventually merged into MRC2000 (around 1982(verify)) when agreements for compatibility were settled (verify)

(machine stamp settled somewhat later?)(verify).


MRC2000 seems to be the variant that most unnamed MRC variants tend to adhere to, unless you can detect it as a more specific format.

There's also naming confusion. For example, people also say CCP4, though may be conflating it with MRC2000 in uses where their difference in definitions (origin, skew transformation) do not conflict.


Extended headers are even more interesting -- more on that later.


Pre-MRC2000

Trying to deal with all possible historical variants will not make you overly happy.

Since MRC2000 and CCP4 cover most ground these days, you generally you don't have to care about this anymore.


Programmers may still care to write code that will do something sensible enough for all old cases, without too much manual prompting, so these are some details to get you started


'old'

Mostly like the later Image2000 (which is now common), but instead of

xorg, yorg, zorg, cmap, stamp, and rms,

it has

nwave, wave1, wave2, wave3, wave4, wave5, wave6, zorg, xorg, yorg


imsubs

Derived from old-style MRC?(verify)

See also:


Two subtypes:

priism

Optical. (see [1])

Uses what later became MAP (and a few other things), so can be considered a separate thing.

http://msg.ucsf.edu/IVE/IVE-4-6-0-HTML/priism_mrc_header.html

CCP4

Basically refers to use of CCP4's maplib code, which works out as one particular pre-MRC2000 format.


CCP4 is sometimes listed alongside the MRC-LMB variant, when refers to the origins of these two formats.

MRC2000

The common-denominator format for a long while, particularly for raster image data.

At the time, it was a merger of the CCP4 and LMB variants.

More recently succeded by MRC2014 (which mostly adds NVERSION, EXTTYPE).


Possibly not the same as image2000(verify), apparently named for its mention in imsubs2000(verify)


See also:


See also:

IMOD MRC

IMOD generally follows

older style before IMOD 2.6.20(verify)
MRC2000 header layout since IMOD 2.6.20,


From MRC2000's view it also uses some EXTRA space. From IMOD's view it reduces the EXTRA space (further) to 5 values from byte 132 to 151.

Specifically, it uses:

  • characters in position 104..109: EXTTYPE
signals how to interpret the extended header data, one of at least SERI, FEI1, AGAR.
You will often also need NINT and NREAL to interpret them
  • uint4 in byte position 152..155: IMODSTAMP
value 1146047817 (0x444F4D49) indicates IMOD type file, or other software that sets IMODFLAGS
  • uint4 in byte position 156..159: IMODFLAGS




See also:


UCSF MRC

Mostly MRC2000 style, adds some of its own (some from PRIISM?)

http://msg.ucsf.edu/IVE/IVE-4-6-0-HTML/em_mrc_header.html


Has the Agard-style extended header

FEI MRC

From what I've read so far, 'FEI' in this context can refer to two distinct adaptations:

  • variant of MRC2000 MRC, similar to but distinct from the UCSF alterations
plus the UCSF-style FEI extended header (adds a few fields not in the UCSF extended header, doesn't change how it's parsed)


MRC2014

In practice a clarification of a number of MRC2000, and some mild extensions.

More of a compliance thing that varied files can mark they adhere to than its own complete standard.

For example, IMOD now follows MRC2014 (so in the sense of "meaning in main header", you can consider that the third version of IMOD's format: old, 2000, 2014).


MRC2014 addressing some extended uses by adopting IMOD's EXTTYPE (apparently adding its own types to it, MRCO and CCP4 - TODO: check their code to see what exactly each means). It does not seem to define the structure of any extended header, though the code the paper refers to does parse most of them.


Formalized in A Cheng (‎2015) "MRC2014: Extensions to the MRC format header for electron cryo-microscopy and tomography", or rather, http://www.ccpem.ac.uk/mrc_format/mrc2014.php

MRC extended header variants

Agard-style extended header
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Used at least by

UCSFtomo/UCSFImage
FEI EPU, Xplore3D, Amira, Avizo

(Agard refers to David Agard)

The extended header bytes will contain NINT amount of integers , then NREAL amount of floats32s.


FEI's specs and files in the wild seem to imply that for known-to-be-FEI files, you should always assume NINT=0 and NREAL=32 (even if not set), making for 128-byte records (32 float32 values), and also fixes the amount of these records at 1024 (records >NZ will be zeroed out), so the FEI extended header will basically always be 131072 bytes large.

Note that there is also a newer-style extended FEI header - see #FEI1 extended header below.


UCSF's records are (4*NINT + 4*NREAL) large and it seems there are always exactly NZ of them.

(I've seen(verify) much larger NREAL than seems useful, not sure what that is)

If the extended header size (according to NSYMBT) is exactly NX*NY*4 bytes larger than that, that's a float32 (Mode 5) gain reference image present within the extended header immediately after the just-mentioned records.


The meanings of fields seem settled by being the so-manieth int and the so-manieth real in the respective sets.


TODO: figure out the ints.


UCSF defines 13 for the reals:

  • alpha tilt
  • beta tilt
  • stage X position
  • stage Y position
  • stage Z position
  • image X shift
  • image Y shift
  • nominal defocus as read from microscope
  • exposure time
  • mean intensity value in image
  • orientation of the tilt axis
  • pixel size
  • magnification
  • ...and leaves 19 undefined


FEI adds three more:

  • value of high tension (volts)
  • binning used in acquisition
  • intended application defocus
  • ...leaving 16 undefined


See also:

FEI1 extended header

e.g. MRC2014's testset has a file named fei-extended.mrc with extended type FEI1 that has NSYMBT of 786432 (exactly 768K).

That code (mrcfile) also has a data definition for it. Each record is currently 768 bytes (this number is also the first int32), specifically:

  <i4  Metadata size
  <i4  Metadata version
  <u4  Bitmask 1
  <f8  Timestamp
  S16  Microscope type
  S16  D-Number
  S16  Application
  S16  Application version
  <f8  HT
  <f8  Dose
  <f8  Alpha tilt
  <f8  Beta tilt
  <f8  X-Stage
  <f8  Y-Stage
  <f8  Z-Stage
  <f8  Tilt axis angle
  <f8  Dual axis rotation
  <f8  Pixel size X
  <f8  Pixel size Y
  S48  Unused Range
  <f8  Defocus
  <f8  STEM Defocus
  <f8  Applied defocus
  <i4  Instrument mode
  <i4  Projection mode
  S16  Objective lens mode
  S16  High magnification mode
  <i4  Probe mode
  <b1  EFTEM On
  <f8  Magnification
  <u4  Bitmask 2
  <f8  Camera length
  <i4  Spot index
  <f8  Illuminated area
  <f8  Intensity
  <f8  Convergence angle
  S16  Illumination mode
  <b1  Wide convergence angle range
  <b1  Slit inserted
  <f8  Slit width
  <f8  Acceleration voltage offset
  <f8  Drift tube voltage
  <f8  Energy shift
  <f8  Shift offset X
  <f8  Shift offset Y
  <f8  Shift X
  <f8  Shift Y
  <f8  Integration time
  <i4  Binning Width
  <i4  Binning Height
  S16  Camera name
  <i4  Readout area left
  <i4  Readout area top
  <i4  Readout area right
  <i4  Readout area bottom
  <b1  Ceta noise reduction
  <i4  Ceta frames summed
  <b1  Direct detector electron counting
  <b1  Direct detector align frames
  <i4  Camera param reserved 0
  <i4  Camera param reserved 1
  <i4  Camera param reserved 2
  <i4  Camera param reserved 3
  <u4  Bitmask 3
  <i4  Camera param reserved 4
  <i4  Camera param reserved 5
  <i4  Camera param reserved 6
  <i4  Camera param reserved 7
  <i4  Camera param reserved 8
  <i4  Camera param reserved 9
  <b1  Phase Plate
  S16  STEM Detector name
  <f8  Gain
  <f8  Offset
  <i4  STEM param reserved 0
  <i4  STEM param reserved 1
  <i4  STEM param reserved 2
  <i4  STEM param reserved 3
  <i4  STEM param reserved 4
  <f8  Dwell time
  <f8  Frame time
  <i4  Scan size left
  <i4  Scan size top
  <i4  Scan size right
  <i4  Scan size bottom
  <f8  Full scan FOV X
  <f8  Full scan FOV Y
 <c16  Element
  <f8  Energy interval lower
  <f8  Energy interval higher
  <i4  Method
  <b1  Is dose fraction
  <i4  Fraction number
  <i4  Start frame
  <i4  End frame
  S80  Input stack filename
  <u4  Bitmask 4
  <f8  Alpha tilt min
  <f8  Alpha tilt max
SerialEM extended header
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

"a series of short integers. The short integers are signed, except for piece coordinates."

"SerialEM stores a float as two shorts, s1 and s2, by: value = (sign of s1)*(|s1|*256 + (|s2| modulo 256)) * 2**((sign of s2) * (|s2|/256))"

The closest I've seen to a reference of what's in there is LoadExtraFromValues in KStoreADOC.cpp

On detecting variants

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Since IMOD and MRC2014 allow other's extended headers, you probably want to separate parsing of the main header and the extended header.


From what I can gather, the following is decent for modern variants:

if IMODSTAMP == 0x444F4D49

mrctype = IMOD
exttype = (take from EXTTYPE)

else if NSYMBT is == (NZ*(4*NINT+4*NREAL))+NX*NY*4

mrctype = UCSF
exttype = agard
gainref = n

else if NSYMBT is == (NZ*(4*NINT+4*NREAL))

mrctype = UCSF
exttype = agard
gainref = y

else if NSYMBT == 131072

mrctype = FEI
exttype = agard

else

mrctype = MRC2000
exttype = none or unknown, depending on NSYMBT value


Also, if NVERSION == 20140

MRC2014_compliance = y


Notes:

  • Not always reliable:
MAP
IMOD's EXTTYPE (verify)
assuming FEI to act exactly like agard (verify)

On MODE

As far as I can tell:

Standard:

  • 0: signed int8 (see notes on signed/unsigned mode 0 below)
  • 1: signed int16
  • 2: float32 (IEEE-style)
  • 3: pairs of signed_int16 (complex, for FFTs)
  • 4: pairs of float32 (complex, for FFTs)

Additions:

  • IMOD has the the basic five mentioned above, plus
    • has its own flag that lets 0 possibly mean unsigned int8 instead
    • 6: unsigned_int16
    • 16: RGB in 3*unsigned_int8
    • 101: 4-bit data
  • imsubs has the basic five mentioned above, plus
    • 5: signed_int16 (so identical to 1?(verify))
    • 6: unsigned_int16
    • 7: signed_int32
    • 101? Seems to be my misreading
  • image2000 has the basic five mentioned above, plus
    • 6 is unsigned_int16
  • FEI has the standard 0..4
    • Does not seem to mention changes (from what, though? presumably image2000?)
    • also seems to mention it's always unsigned_int16, and is mode is 6 in all EPU files I have seen so far
  • MRC2014
    • 6 is unsigned_int16


Notes:

  • Mode 6: useful and common enough to consider standardish for long before it was standardized
so supporting reading of mode 6 is a good idea these days.
(you could, at write time, check whether your integer data is all in 0..32K so could fit in the more standard mode 3, but there's not much point these days).
  • Mode 0:
was historically signed int8, and e.g. CCP4 stands by this.
some programs might write unsigned int8 to it
the argument is roughly that detectors don't count negative events, so why limit yourself to half the range?
even if you get the other type, you could detect and often fix based on the distribution
IMOD has a flag that signals which one it's using

On images versus stacks versus volumes versus volume stacks

There are some conventions.

That far from everyone uses.

Some of them are header-based (see below), others are file-extension based or a combination - e.g. "when NZ>1 and file extension is mrcs / st / vol".


MRC2014 proposses to make the header-based ones a little more formal.


From CCP4's description of MRC2000 (see e.g. [2]) and some programs, it seems that:

if ISPG=0 and MZ=1 and NZ=1:
image
else if ISPG=0 and MZ=1 and NZ>1:
image_stack
else if ISPG=401:
volume_stack
else if ISPG=1 and NZ=MZ and MZ>0:
volume
else
dont_know


This does not consider movie-mode data, which in itself is okay since you can just call them stacks -- except in the case of movie-mode tilt series, e.g. as output by modern UCSFTomo. Here the details of the file are still implied by main+extended header values. You can e.g. tell

tilt series by changing tilt angle, and
movie-mode tilt series by having both changing tilt angle overall and having a bunch of adjacent ones have the same angle.


I really don't know about how much the real world follows this, and/or other conventions. e.g. one of MRC2014's example files has ISPG=4 NZ=25 MZ=72 and I have no idea what that means.

On NSYMBT and NEXT

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

tl;dr: A uint32 at byte position 92..95 (in zero-based counting) marks the amount of extra bytes between the basic 1024-byte header, and the start of image bytes

So if you read this value position-based, you're good for all formats(verify).


If you do it name-based, there are some more notes.

For one, be careful of confusing NSYMBT and NEXT.

Its classical name, NSYMBT, refers to group symmetry records. The value would often be 0 (no extra data) or 80.

formats like MRC2000, and MRC2014 call it NSYMBT


Other formats named it NEXT, probably just to indicate it can be other things

such as FEI, UCSF, IMOD
UCSF/FEI style seem to have the renamed-NEXT at the usual position (uint4 at byte position 92..95), but also a NSYMBT (uint2 at byte position 90..91). TODO: figure out what's up with that. FEI suggests it will always be zero, suggesting it can be safely ignored.


Some formats also signal what the extended header contains.

such as IMOD and MRC2014 (sort of)

On EXTRA

EXTRA refers to the MRC variants saying some parts of the main header are undefined and can be used for user-defined use.

From the view of the widest definition of EXTRA fields (that in MRC2000), most other variants (CCP4, UCSF, IMOD, FEI, MRC2014) redefine some of these for themselves. Sometimes based on others, often not so much.


On MAP

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Helps signal what sort of contents the voxel data has

Values include

  • MAP various.
  •      (i.e. four spaces) various
  • MRC IMG/MRC(verify)
  • IMG IMG-MAP(verify)
  • MTZ contains structure factors, see [3] [4]


In the early days it was more indicative than it is now. e.g. CCP4 would often use MAP , pre-2000 MRC typically would leave it empty (not being defined there, falling in its EXTRA range), or in IMOD(verify) (though modern IMOD uses it, presumably to comply with MRC2014)

On MACHST

In olden days, there were more actively used standards around of the coding/serialization of integers and floating point numbers. MRC mention both endianness and this coding. (presumably to allow architectures to save/load their native way for speed, while also allowing interchange in a verifiably correct way)

Well, used to, maybe, before my time. I'm not sure I've ever seen a valid MACHST in any real-world file that I've dealt with in the last half decade.


If you care about compatibility with other endianness, you probably want your own detection

but since most things are x86, you can often get away with just loading what got dumped

Also, you can typically get away with

assuming floats are IEEE style (correct for most or all modern MRC data)
image integers have the same endianness as they do in the header



On MAPC, MAPR, MAPS

tl;dr:

Files that use unusual ordering will often have these filled, and valid (actually a permutation of 1,2,3)
Not everyone uses it, or uses it correctly.
if not filled / not valid (0,0,0, 1,1,1, or such), good luck assuming / detecting.
If sensor is square, getting it wrong only amounts to a transpose (meaning a handedness change) so most people ignore this because they can get away with it

You can hope your detector is square, so that the most likely mistake (swapping rows and columns) happens to only be a transpose.


See also

What you can find. e.g.





STAR and CIF files

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


STAR is Self Defining Text Archival and Retrieval


CIF is Crystallographic Information Format, a restricted form of STAR that is (thereby) a little easier to use and parse.


Both are structured, but look like text files, and while you can sometimes get away with hackish parsing (e.g. when it stores only a flat table), proper interpretation needs a real parser (there is an EBNF spec).


See also:


SerialEM mdoc

These are IMOD autodoc files, which are "named sections and key-value pairs within sections"[5] (INI-like)

...with some further conventions like that

TODO: find decent summary/reference


IMOD model files

IMOD model files are binary files that can contain contours, meshes, and surfaces, and a bunch of detail on how to draw them.

http://bio3d.colorado.edu/imod/betaDoc/binspec.html


There is also an ASCII variant, which allows only a subset but can be easier to show/transfer data.

http://bio3d.colorado.edu/imod/betaDoc/asciispec.html


Notes:

  • It seems that points (e.g. fiducials) are stored as length=1 contours
  • There are a bunch of model-file-related IMOD commands
see e.g. https://bio3d.colorado.edu/imod/doc/program_listing.html


EMI and SER

At acquire time, you usually get:

EMI, which has file metadata like microscope settings and usually also contains the image data
proprietary format: file structure is not published, only TIA can read it (based on ES Vision, acquired by FEI)
SER (a.k.a. "ES Series Data, ESD"(verify)), contains the same image data (without that metadata)
and is an open format so approx a dozen software packages read it


What exactly TIA writes depends on what you were using it for.

Often it's emi+ser with all of the userful image in ser In this case, you can convert the SER on any PC(and ignore the EMI)


In some cases you only get emi, and you want to convert it to something (e.g. tiff) while still within TIA (or later on a PC with TIA, but there are few of those that aren't microscope computers).

Note that TIA has a batch converter that you can run on your directory with data.


Software may make its own assumptions, and therefore not read all cases

  • in particular, some software does not understand the case of ser containing more than one image (stacks)
some may read only the first frame
or may read all but not have functionality to write them separately (some versions of TIA(verify))
or may make it hard (e.g. TIAReader, which seems to open SER(verify), opens a window for each, with no clue as to which is which frame)


Software you can consider includes:

EMI and SER:

  • TIA. It should be installed on your FEI support PCs.
You can do batch conversion of a bunch of files
ser stacks: can read, cannot export (verify)


And for SER:

ser stacks: works
There is also a macro you may care about called BatchConvertSer.ijm
ser stacks: opens many windows without a clue about ordering. Batch converter takes only first image(verify)
  • xmipp-image-convert (Xmipp)
ser stack: seems to work (though failed for my case due to a code bug)
  • em2em (IMAGIC)
basically, things like e2proc2d.py can read it
ser stacks: single image only? (verify)


See also:

Unsorted

mrcz

mrcz is compressed mrc.

It is a specific file structure based on the blosc library, allowing for clever adaptive choice of compression per block of data, and easier use of fast multicore compressors like lz4 and zstd.

While possibly a bit overdesigned, one of the neat ideas is that it allows fast decompression speeds at moderate compression rates, so that you can get read speed close to uncompressed speeds while saving considerable space.


Personal take:

For short-term processing, it's relevant that nothing much seems to read mrcz yet -- and that ZFS can give me lz4 fully transparent to programs (but I see that that's not an option to everyone, and meaningless for transfers)
For transfer, I use something (currently) easier to explain to everyone (e.g. gzip)
For long-term archiving (where CPU cost is one-time, so the cost comes from storage), you can use something that compresses better than lz4 (like bzip2 - though maximum-compression zstd seems to compress slightly better than it(verify))


See also:

Situs format

Apparently plain MRCs with extensions suggesting what they are and sometimes how they should be treated

vol

Spider volumes.


spe

Princeton Instruments CCD


stk

stack

st

stack


EMSA/MAS

http://www.iso.org/iso/catalogue_detail.htm?csnumber=56211

https://www.iso.org/obp/ui/#iso:std:iso:22029:ed-2:v1:en

http://journals.cambridge.org/action/displayFulltext?type=1&fid=128304&jid=MAM&volumeId=8&issueId=S02&aid=128303

Gatan

Some notes of its MRCs being IMOD-style(verify)


DM2

Written by Digital Micrograph versions 2.something. DM 2.5 changed its version tag from 200 to 250

DM3 and DM4

Are fairly similar in setup, though vary in sizes of some elements.


See also (DM)

EMX

Electron Microscopy Exchange, basically seems to be a standardization using XML to describe the most important acquisition-time parameters of CCP4-style MRC.

See also: