Electron Microscopy file format notes
MRC files
MRC as in the image format used in crystallography, EM, and such.
General structure:
- a 1024-byte header
- 224 bytes in variables
- 800 bytes in ten 80-byte labels (how many are used is set by one of the variables, though in most cases unused labels are zeroed out, so you can also just look at the contents)
- optional additional header section(s). Size and presence is implied in the 1024-byte fixed header.
- The format of this header varies with the exact variant
- data for image(s)
- uncompressed array
- endianness, pixel format, size, and row/column ordering is determined from the header
Note:
- The bare minimum is basically the first four fields: dimensions in X, Y, Z, and the MODE (pixel data type).
- There is a (recentish) convention in that if the file extension is mrcs, this is a stack rather than a volume (when NZ>1)
It's hard to make a reader that takes even all the well-specified variants without some external knowledge. It's impossible to include everyone's conventions over the years.
So it's impossible to take out exactly what its original writer intended put in there. Within the same software package, sure. Within the same field, sort of. In general, good luck. This is why I have some choice words for this whole family of formats, because as a family, it hasn't really talked to each other for over thirty years.
On the other hand, most programs don't use most of the main header.
If you use it as a dump-in-a-file format (Dimensions, pixel mode, data, implied endianness),
that works pretty well.
And while the extended headers are varied, they are also your best hope for storing more interesting data in a structured way.
MRC main header variants
MRC came from the MRC-LMB
The CCP4 used the format in a different way, so had its own conventions.
The two eventually merged into MRC2000 (around 1982(verify)) when agreements for compatibility were settled (verify)
- (machine stamp settled somewhat later?)(verify).
MRC2000 seems to be the variant that most unnamed MRC variants tend to adhere to,
unless you can detect it as a more specific format.
There's also naming confusion. For example, people also say CCP4, though may be conflating it with MRC2000 in uses where their difference in definitions (origin, skew transformation) do not conflict.
Extended headers are even more interesting -- more on that later.
Pre-MRC2000
Trying to deal with all possible historical variants will not make you overly happy.
Since MRC2000 and CCP4 cover most ground these days, you generally you don't have to care about this anymore.
Programmers may still care to write code that will do something sensible enough for all old cases,
without too much manual prompting, so these are some details to get you started
'old'
Mostly like the later Image2000 (which is now common), but instead of
xorg, yorg, zorg, cmap, stamp, and rms,
it has
nwave, wave1, wave2, wave3, wave4, wave5, wave6, zorg, xorg, yorg
imsubs
Derived from old-style MRC?(verify)
See also:
Two subtypes:
- optical, seems closer to old-style MRC (verify)
- EM, seems close to mrc2000 (verify)
priism
Optical. (see [1])
Uses what later became MAP (and a few other things), so can be considered a separate thing.
http://msg.ucsf.edu/IVE/IVE-4-6-0-HTML/priism_mrc_header.html
CCP4
Basically refers to use of CCP4's maplib code, which works out as one particular pre-MRC2000 format.
CCP4 is sometimes listed alongside the MRC-LMB variant, when refers to the origins of these two formats.
MRC2000
The common-denominator format for a long while, particularly for raster image data.
At the time, it was a merger of the CCP4 and LMB variants.
More recently succeded by MRC2014 (which mostly adds NVERSION, EXTTYPE).
Possibly not the same as image2000(verify), apparently named for its mention in imsubs2000(verify)
See also:
- http://www2.mrc-lmb.cam.ac.uk/image2000.html
- http://blake.bcm.edu/eman2/doxygen_html/structEMAN_1_1MrcIO_1_1MrcHeader.html
See also:
- http://www.2dx.unibas.ch/documentation/mrc-software/mrc-documentation/announement-image2000
- http://www.2dx.unibas.ch/documentation/mrc-software/mrc-documentation/image.doc (verify)
IMOD MRC
IMOD generally follows
- older style before IMOD 2.6.20(verify)
- MRC2000 header layout since IMOD 2.6.20,
From MRC2000's view it also uses some EXTRA space.
From IMOD's view it reduces the EXTRA space (further) to 5 values from byte 132 to 151.
Specifically, it uses:
- characters in position 104..109: EXTTYPE
- signals how to interpret the extended header data, one of at least SERI, FEI1, AGAR.
- You will often also need NINT and NREAL to interpret them
- uint4 in byte position 152..155: IMODSTAMP
- value 1146047817 (0x444F4D49) indicates IMOD type file, or other software that sets IMODFLAGS
- uint4 in byte position 156..159: IMODFLAGS
See also:
UCSF MRC
Mostly MRC2000 style, adds some of its own (some from PRIISM?)
http://msg.ucsf.edu/IVE/IVE-4-6-0-HTML/em_mrc_header.html
Has the Agard-style extended header
FEI MRC
From what I've read so far, 'FEI' in this context can refer to two distinct adaptations:
- variant of MRC2000 MRC, similar to but distinct from the UCSF alterations
- plus the UCSF-style FEI extended header (adds a few fields not in the UCSF extended header, doesn't change how it's parsed)
- IMOD MRC file with the Agard/FEI-style extended header(verify)
MRC2014
In practice a clarification of a number of MRC2000, and some mild extensions.
More of a compliance thing that varied files can mark they adhere to than its own complete standard.
For example, IMOD now follows MRC2014 (so in the sense of "meaning in main header", you can consider that the third version of IMOD's format: old, 2000, 2014).
MRC2014 addressing some extended uses by adopting IMOD's EXTTYPE (apparently adding its own types to it, MRCO and CCP4 - TODO: check their code to see what exactly each means).
It does not seem to define the structure of any extended header, though the code the paper refers to does parse most of them.
Formalized in A Cheng (2015) "MRC2014: Extensions to the MRC format header for electron cryo-microscopy and tomography", or rather, http://www.ccpem.ac.uk/mrc_format/mrc2014.php
MRC extended header variants
Agard-style extended header
Used at least by
- UCSFtomo/UCSFImage
- FEI EPU, Xplore3D, Amira, Avizo
(Agard refers to David Agard)
The extended header bytes will contain NINT amount of integers , then NREAL amount of floats32s.
FEI's specs and files in the wild seem to imply that for known-to-be-FEI files, you should always assume NINT=0 and NREAL=32 (even if not set), making for 128-byte records (32 float32 values),
and also fixes the amount of these records at 1024 (records >NZ will be zeroed out),
so the FEI extended header will basically always be 131072 bytes large.
Note that there is also a newer-style extended FEI header - see #FEI1 extended header below.
UCSF's records are (4*NINT + 4*NREAL) large and it seems there are always exactly NZ of them.
- (I've seen(verify) much larger NREAL than seems useful, not sure what that is)
If the extended header size (according to NSYMBT) is exactly NX*NY*4 bytes larger than that, that's a float32 (Mode 5) gain reference image present within the extended header immediately after the just-mentioned records.
The meanings of fields seem settled by being the so-manieth int and the so-manieth real in the respective sets.
TODO: figure out the ints.
UCSF defines 13 for the reals:
- alpha tilt
- beta tilt
- stage X position
- stage Y position
- stage Z position
- image X shift
- image Y shift
- nominal defocus as read from microscope
- exposure time
- mean intensity value in image
- orientation of the tilt axis
- pixel size
- magnification
- ...and leaves 19 undefined
FEI adds three more:
- value of high tension (volts)
- binning used in acquisition
- intended application defocus
- ...leaving 16 undefined
See also:
- http://msg.ucsf.edu/IVE/IVE-4-6-0-HTML/em_mrc_header.html
- http://msg.ucsf.edu/IVE/IVE4_HTML/UCSFTomographyExtendedHeader.html
- http://www.2dx.unibas.ch/documentation/mrc-software/fei-extended-mrc-format-not-used-by-2dx
- http://blake.bcm.edu/eman2/doxygen_html/structEMAN_1_1MrcIO_1_1FeiMrcHeader.html
- http://blake.bcm.edu/eman2/doxygen_html/structEMAN_1_1MrcIO_1_1FeiMrcExtHeader.html
- http://blake.bcm.edu/eman2/doxygen_html/mrcio_8h_source.html
FEI1 extended header
e.g. MRC2014's testset has a file named fei-extended.mrc with extended type FEI1 that has NSYMBT of 786432 (exactly 768K).
That code (mrcfile) also has a data definition for it. Each record is currently 768 bytes (this number is also the first int32), specifically:
<i4 Metadata size <i4 Metadata version <u4 Bitmask 1 <f8 Timestamp S16 Microscope type S16 D-Number S16 Application S16 Application version <f8 HT <f8 Dose <f8 Alpha tilt <f8 Beta tilt <f8 X-Stage <f8 Y-Stage <f8 Z-Stage <f8 Tilt axis angle <f8 Dual axis rotation <f8 Pixel size X <f8 Pixel size Y S48 Unused Range <f8 Defocus <f8 STEM Defocus <f8 Applied defocus <i4 Instrument mode <i4 Projection mode S16 Objective lens mode S16 High magnification mode <i4 Probe mode <b1 EFTEM On <f8 Magnification <u4 Bitmask 2 <f8 Camera length <i4 Spot index <f8 Illuminated area <f8 Intensity <f8 Convergence angle S16 Illumination mode <b1 Wide convergence angle range <b1 Slit inserted <f8 Slit width <f8 Acceleration voltage offset <f8 Drift tube voltage <f8 Energy shift <f8 Shift offset X <f8 Shift offset Y <f8 Shift X <f8 Shift Y <f8 Integration time <i4 Binning Width <i4 Binning Height S16 Camera name <i4 Readout area left <i4 Readout area top <i4 Readout area right <i4 Readout area bottom <b1 Ceta noise reduction <i4 Ceta frames summed <b1 Direct detector electron counting <b1 Direct detector align frames <i4 Camera param reserved 0 <i4 Camera param reserved 1 <i4 Camera param reserved 2 <i4 Camera param reserved 3 <u4 Bitmask 3 <i4 Camera param reserved 4 <i4 Camera param reserved 5 <i4 Camera param reserved 6 <i4 Camera param reserved 7 <i4 Camera param reserved 8 <i4 Camera param reserved 9 <b1 Phase Plate S16 STEM Detector name <f8 Gain <f8 Offset <i4 STEM param reserved 0 <i4 STEM param reserved 1 <i4 STEM param reserved 2 <i4 STEM param reserved 3 <i4 STEM param reserved 4 <f8 Dwell time <f8 Frame time <i4 Scan size left <i4 Scan size top <i4 Scan size right <i4 Scan size bottom <f8 Full scan FOV X <f8 Full scan FOV Y <c16 Element <f8 Energy interval lower <f8 Energy interval higher <i4 Method <b1 Is dose fraction <i4 Fraction number <i4 Start frame <i4 End frame S80 Input stack filename <u4 Bitmask 4 <f8 Alpha tilt min <f8 Alpha tilt max
SerialEM extended header
"a series of short integers. The short integers are signed, except for piece coordinates."
"SerialEM stores a float as two shorts, s1 and s2, by: value = (sign of s1)*(|s1|*256 + (|s2| modulo 256)) * 2**((sign of s2) * (|s2|/256))"
The closest I've seen to a reference of what's in there is LoadExtraFromValues in KStoreADOC.cpp
On detecting variants
Since IMOD and MRC2014 allow other's extended headers, you probably want to separate parsing of the main header and the extended header.
From what I can gather, the following is decent for modern variants:
if IMODSTAMP == 0x444F4D49
- mrctype = IMOD
- exttype = (take from EXTTYPE)
else if NSYMBT is == (NZ*(4*NINT+4*NREAL))+NX*NY*4
- mrctype = UCSF
- exttype = agard
- gainref = n
else if NSYMBT is == (NZ*(4*NINT+4*NREAL))
- mrctype = UCSF
- exttype = agard
- gainref = y
else if NSYMBT == 131072
- mrctype = FEI
- exttype = agard
else
- mrctype = MRC2000
- exttype = none or unknown, depending on NSYMBT value
Also, if NVERSION == 20140
- MRC2014_compliance = y
Notes:
- Not always reliable:
On MODE
As far as I can tell:
Standard:
- 0: signed int8 (see notes on signed/unsigned mode 0 below)
- 1: signed int16
- 2: float32 (IEEE-style)
- 3: pairs of signed_int16 (complex, for FFTs)
- 4: pairs of float32 (complex, for FFTs)
Additions:
- IMOD has the the basic five mentioned above, plus
- has its own flag that lets 0 possibly mean unsigned int8 instead
- 6: unsigned_int16
- 16: RGB in 3*unsigned_int8
- 101: 4-bit data
- imsubs has the basic five mentioned above, plus
- 5: signed_int16 (so identical to 1?(verify))
- 6: unsigned_int16
- 7: signed_int32
- 101? Seems to be my misreading
- image2000 has the basic five mentioned above, plus
- 6 is unsigned_int16
- FEI has the standard 0..4
- Does not seem to mention changes (from what, though? presumably image2000?)
- also seems to mention it's always unsigned_int16, and is mode is 6 in all EPU files I have seen so far
- MRC2014
- 6 is unsigned_int16
Notes:
- Mode 6: useful and common enough to consider standardish for long before it was standardized
- so supporting reading of mode 6 is a good idea these days.
- (you could, at write time, check whether your integer data is all in 0..32K so could fit in the more standard mode 3, but there's not much point these days).
- Mode 0:
- was historically signed int8, and e.g. CCP4 stands by this.
- some programs might write unsigned int8 to it
- the argument is roughly that detectors don't count negative events, so why limit yourself to half the range?
- even if you get the other type, you could detect and often fix based on the distribution
- IMOD has a flag that signals which one it's using
On images versus stacks versus volumes versus volume stacks
There are some conventions.
That far from everyone uses.
Some of them are header-based (see below), others are file-extension based or a combination - e.g. "when NZ>1 and file extension is mrcs / st / vol".
MRC2014 proposses to make the header-based ones a little more formal.
From CCP4's description of MRC2000 (see e.g. [2]) and some programs, it seems that:
- if ISPG=0 and MZ=1 and NZ=1:
- image
- else if ISPG=0 and MZ=1 and NZ>1:
- image_stack
- else if ISPG=401:
- volume_stack
- else if ISPG=1 and NZ=MZ and MZ>0:
- volume
- else
- dont_know
This does not consider movie-mode data, which in itself is okay since you can just call them stacks -- except in the case of movie-mode tilt series, e.g. as output by modern UCSFTomo. Here the details of the file are still implied by main+extended header values.
You can e.g. tell
- tilt series by changing tilt angle, and
- movie-mode tilt series by having both changing tilt angle overall and having a bunch of adjacent ones have the same angle.
I really don't know about how much the real world follows this, and/or other conventions.
e.g. one of MRC2014's example files has ISPG=4 NZ=25 MZ=72 and I have no idea what that means.
On NSYMBT and NEXT
tl;dr: A uint32 at byte position 92..95 (in zero-based counting) marks the amount of extra bytes between the basic 1024-byte header, and the start of image bytes
So if you read this value position-based, you're good for all formats(verify).
If you do it name-based, there are some more notes.
For one, be careful of confusing NSYMBT and NEXT.
Its classical name, NSYMBT, refers to group symmetry records. The value would often be 0 (no extra data) or 80.
- formats like MRC2000, and MRC2014 call it NSYMBT
Other formats named it NEXT, probably just to indicate it can be other things
- such as FEI, UCSF, IMOD
- UCSF/FEI style seem to have the renamed-NEXT at the usual position (uint4 at byte position 92..95), but also a NSYMBT (uint2 at byte position 90..91). TODO: figure out what's up with that. FEI suggests it will always be zero, suggesting it can be safely ignored.
Some formats also signal what the extended header contains.
- such as IMOD and MRC2014 (sort of)
On EXTRA
EXTRA refers to the MRC variants saying some parts of the main header are undefined and can be used for user-defined use.
From the view of the widest definition of EXTRA fields (that in MRC2000), most other variants (CCP4, UCSF, IMOD, FEI, MRC2014) redefine some of these for themselves. Sometimes based on others, often not so much.
On MAP
Helps signal what sort of contents the voxel data has
Values include
- MAP various.
- (i.e. four spaces) various
- MRC IMG/MRC(verify)
- IMG IMG-MAP(verify)
- MTZ contains structure factors, see [3] [4]
In the early days it was more indicative than it is now.
e.g. CCP4 would often use MAP , pre-2000 MRC typically would leave it empty (not being defined there, falling in its EXTRA range), or in IMOD(verify) (though modern IMOD uses it, presumably to comply with MRC2014)
On MACHST
In olden days, there were more actively used standards around of the coding/serialization of integers and floating point numbers. MRC mention both endianness and this coding. (presumably to allow architectures to save/load their native way for speed, while also allowing interchange in a verifiably correct way)
Well, used to, maybe, before my time. I'm not sure I've ever seen a valid MACHST in any real-world file that I've dealt with in the last half decade.
If you care about compatibility with other endianness, you probably want your own detection
- but since most things are x86, you can often get away with just loading what got dumped
Also, you can typically get away with
- assuming floats are IEEE style (correct for most or all modern MRC data)
- image integers have the same endianness as they do in the header
On MAPC, MAPR, MAPS
tl;dr:
- Files that use unusual ordering will often have these filled, and valid (actually a permutation of 1,2,3)
- Not everyone uses it, or uses it correctly.
- if not filled / not valid (0,0,0, 1,1,1, or such), good luck assuming / detecting.
- If sensor is square, getting it wrong only amounts to a transpose (meaning a handedness change) so most people ignore this because they can get away with it
You can hope your detector is square, so that the most likely mistake (swapping rows and columns) happens to only be a transpose.
See also
What you can find. e.g.
STAR and CIF files
STAR is Self Defining Text Archival and Retrieval
CIF is Crystallographic Information Format, a restricted form of STAR that is (thereby) a little easier to use and parse.
Both are structured, but look like text files, and while you can sometimes get away with hackish parsing (e.g. when it stores only a flat table), proper interpretation needs a real parser (there is an EBNF spec).
See also:
- Hall, Sydney R. (1991), The STAR file: a new format for electronic data transfer and archiving, J. Chem. Inf. Comput. Sci., 31, 326-333
- Hall, Sydney R. & Spadaccini, N. (1994), The STAR File: detailed specifications, J. Chem. Inf. Comput. Sci., 34, 505-508
SerialEM mdoc
These are IMOD autodoc files, which are "named sections and key-value pairs within sections"[5] (INI-like)
...with some further conventions like that
TODO: find decent summary/reference
IMOD model files
IMOD model files are binary files that can contain contours, meshes, and surfaces, and a bunch of detail on how to draw them.
http://bio3d.colorado.edu/imod/betaDoc/binspec.html
There is also an ASCII variant, which allows only a subset but can be easier to show/transfer data.
http://bio3d.colorado.edu/imod/betaDoc/asciispec.html
Notes:
- It seems that points (e.g. fiducials) are stored as length=1 contours
- There are a bunch of model-file-related IMOD commands
EMI and SER
At acquire time, you usually get:
- EMI, which has file metadata like microscope settings and usually also contains the image data
- proprietary format: file structure is not published, only TIA can read it (based on ES Vision, acquired by FEI)
- SER (a.k.a. "ES Series Data, ESD"(verify)), contains the same image data (without that metadata)
- and is an open format so approx a dozen software packages read it
What exactly TIA writes depends on what you were using it for.
Often it's emi+ser with all of the userful image in ser In this case, you can convert the SER on any PC(and ignore the EMI)
In some cases you only get emi, and you want to convert it to something (e.g. tiff) while still within TIA (or later on a PC with TIA, but there are few of those that aren't microscope computers).
Note that TIA has a batch converter that you can run on your directory with data.
Software may make its own assumptions, and therefore not read all cases
- in particular, some software does not understand the case of ser containing more than one image (stacks)
- some may read only the first frame
- or may read all but not have functionality to write them separately (some versions of TIA(verify))
- or may make it hard (e.g. TIAReader, which seems to open SER(verify), opens a window for each, with no clue as to which is which frame)
Software you can consider includes:
EMI and SER:
- TIA. It should be installed on your FEI support PCs.
- You can do batch conversion of a bunch of files
- ser stacks: can read, cannot export (verify)
And for SER:
- ser stacks: works
- ImageJ / Fiji[7] with TIA Reader plugin
- There is also a macro you may care about called BatchConvertSer.ijm
- ser stacks: opens many windows without a clue about ordering. Batch converter takes only first image(verify)
- xmipp-image-convert (Xmipp)
- ser stack: seems to work (though failed for my case due to a code bug)
- em2em (IMAGIC)
- basically, things like e2proc2d.py can read it
- ser stacks: single image only? (verify)
- EMAN2 (verify)
- IMOD (verify)
See also:
Unsorted
mrcz
mrcz is compressed mrc.
It is a specific file structure based on the blosc library, allowing for clever adaptive choice of compression per block of data, and easier use of fast multicore compressors like lz4 and zstd.
While possibly a bit overdesigned, one of the neat ideas is that it allows fast decompression speeds at moderate compression rates, so that you can get read speed close to uncompressed speeds while saving considerable space.
Personal take:
- For short-term processing, it's relevant that nothing much seems to read mrcz yet -- and that ZFS can give me lz4 fully transparent to programs (but I see that that's not an option to everyone, and meaningless for transfers)
- For transfer, I use something (currently) easier to explain to everyone (e.g. gzip)
- For long-term archiving (where CPU cost is one-time, so the cost comes from storage), you can use something that compresses better than lz4 (like bzip2 - though maximum-compression zstd seems to compress slightly better than it(verify))
See also:
- R A McLeod et al. (2018) "MRCZ – A file format for cryo-TEM data with fast compression"
Situs format
Apparently plain MRCs with extensions suggesting what they are and sometimes how they should be treated
vol
Spider volumes.
spe
Princeton Instruments CCD
stk
stack
st
stack
EMSA/MAS
http://www.iso.org/iso/catalogue_detail.htm?csnumber=56211
https://www.iso.org/obp/ui/#iso:std:iso:22029:ed-2:v1:en
Gatan
Some notes of its MRCs being IMOD-style(verify)
DM2
Written by Digital Micrograph versions 2.something. DM 2.5 changed its version tag from 200 to 250
DM3 and DM4
Are fairly similar in setup, though vary in sizes of some elements.
See also (DM)
EMX
Electron Microscopy Exchange, basically seems to be a standardization using XML to describe the most important acquisition-time parameters of CCP4-style MRC.
See also: