EM file format notes

From Helpful
Jump to: navigation, search
Notes related to Electron Microscopy

EM software notes · EM file format notes · Other EM notes

MRC files

MRC as in the image format used in crystallography, EM, and such.


General structure:

  • a 1024-byte header
    • 224 bytes in variables
    • 800 bytes in ten 80-byte labels (how many are used is set by one of the variables, though in most cases unused labels are zeroed out, so you can also just look at the contents)
  • optional additional header section(s). Size and presence is implied in the 1024-byte fixed header.
The format of this header varies with the exact variant
  • data for image(s)
uncompressed array
endianness, pixel format, size, and row/column ordering is determined from the header


Note:

  • The bare minimum is basically the first four fields: dimensions in X, Y, Z, and the MODE (pixel data type).
  • There is a (recentish) convention in that if the file extension is mrcs, this is a stack rather than a volume (when NZ>1)



It's hard to make a reader that takes even all the well-specified variants without some external knowledge. It's impossible to include everyone's conventions over the years.

So it's impossible to take out exactly what its original writer intended put in there. Within the same software package, sure. Within the same field, sort of. In general, good luck. This is why I have some choice words for this whole family of formats, because as a family, it hasn't really talked to each other for over thirty years.


On the other hand, most programs don't use most of the main header. If you use it as a dump-in-a-file format (Dimensions, pixel mode, data, implied endianness), that works pretty well.

And while the extended headers are varied, they are also your best hope for storing more interesting data in a structured way.


MRC main header variants

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


MRC came from the MRC-LMB

The CCP4 used the format in a different way, so had its own conventions.

The two eventually merged into MRC2000 (approx 1982(verify)) when agreements for compatibility were settled (verify)

(machine stamp settled somewhat later?)(verify).


MRC2000 seems to be the variant that most unnamed MRC variants tend to adhere to, unless you can detect it as a more specific format.

There's also naming confusion. For example, people also say CCP4, though may be conflating it with MRC2000 in uses where their difference in definitions (origin, skew transformation) do not conflict.


Extended headers are even more interesting -- more on that later.


Pre-MRC2000

Generally, you don't have to care about these versions.

MRC2000 and CCP4 cover most ground these days, so generally you don't have to care about this. It will also not make you overly happy.


'old'

Mostly like the later Image2000 (which is now common), but instead of

xorg, yorg, zorg, cmap, stamp, and rms,

it has

nwave, wave1, wave2, wave3, wave4, wave5, wave6, zorg, xorg, yorg


imsubs

Derived from old-style MRC?(verify)

See also:


Two subtypes:

priism

Optical. (see [1])

Uses what later became MAP (and a few other things), so can be considered a separate thing.

http://msg.ucsf.edu/IVE/IVE-4-6-0-HTML/priism_mrc_header.html

CCP4

Basically refers to use of CCP4's maplib code


CCP4 is sometimes listed alongside the MRC-LMB variant, which refers to the origins of these two formats (and more certainly refers to the pre-MRC2000 format).

MRC2000

These days, this is the common-denominator format, at the time merging the CCP4 and LMB definitions.

More recently succeded by MRC2014 (which mostly adds NVERSION, EXTTYPE).


Possibly not the same as image2000(verify), apparently named for its mention in imsubs2000(verify)


See also:


See also:

IMOD MRC

IMOD generally follows

older style before 2.6.20(verify)
Image2000 header layout since 2.6.20,


From MRC2000's view it uses some EXTRA space. From IMOD's view it reduces the EXTRA space (further) to 5 values from byte 132 to 151.

Specifically, it uses:

  • characters in position 104..109: EXTTYPE
signals how to interpret the extended header data, one of at least
SERI
,
FEI1
,
AGAR
.
You will often also need NINT and NREAL to interpret them
  • uint4 in byte position 152..155: IMODSTAMP
value 1146047817 (0x444F4D49) indicates IMOD type file, or other software that sets IMODFLAGS
  • uint4 in byte position 156..159: IMODFLAGS




See also:


UCSF MRC

Mostly MRC2000 style, adds some of its own (some from PRIISM?)

http://msg.ucsf.edu/IVE/IVE-4-6-0-HTML/em_mrc_header.html


Has the Agard-style extended header

FEI MRC

From what I've read so far, 'FEI' in this context can refer to two distinct adaptations:

  • variant of MRC2000 MRC, similar to but distinct from the UCSF alterations
plus the UCSF-style FEI extended header (adds a few fields not in the UCSF extended header, doesn't change how it's parsed)


MRC2014

In practice a clarification of a number of MRC2000, and some mild extensions.

More of a compliance thing that varied files can mark they adhere to than its own complete standard.

For example, IMOD now follows MRC2014 (so in the sense of "meaning in main header", you can consider that the third version of IMOD's format: old, 2000, 2014).


MRC2014 addressing some extended uses by adopting IMOD's EXTTYPE (apparently adding its own types to it, MRCO and CCP4 - TODO: check their code to see what exactly each means). It does not seem to define the structure of any extended header, though the code the paper refers to does parse most of them.


Formalized in A Cheng (‎2015) "MRC2014: Extensions to the MRC format header for electron cryo-microscopy and tomography", or rather, http://www.ccpem.ac.uk/mrc_format/mrc2014.php

MRC extended header variants

Agard-style extended header
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Used at least by

UCSFtomo/UCSFImage
FEI EPU, Xplore3D, Amira, Avizo

(Agard refers to David Agard)

The extended header bytes will contain NINT integers , then NREAL floats32s.


FEI's specs and files in the wild seem to imply that for known-to-be-FEI files, you should always assume NINT=0 and NREAL=32 (even if not set), making for 128-byte records (32 float32 values), and also fixes the amount of these records at 1024 (records >NZ will be zeroed out), so the FEI extended header will basically always be 131072 bytes large.

Note that there is also a newer-style extended FEI header - see #FEI1 extended header below.


UCSF's records are (4*NINT + 4*NREAL) large and it seems there are always exactly NZ of them.

(I've seen(verify) much larger NREAL than seems useful, not sure what that is)

If the extended header size (according to NSYMBT) is exactly NX*NY*4 bytes larger than that, that's a float32 (Mode 5) gain reference image present within the extended header immediately after the just-mentioned records.


The meanings of fields seem settled by being the so-manieth int and the so-manieth real in the respective sets.


TODO: figure out the ints.


UCSF defines 13 for the reals:

  • alpha tilt
  • beta tilt
  • stage X position
  • stage Y position
  • stage Z position
  • image X shift
  • image Y shift
  • nominal defocus as read from microscope
  • exposure time
  • mean intensity value in image
  • orientation of the tilt axis
  • pixel size
  • magnification
  • ...and leaves 19 undefined


FEI adds three more:

  • value of high tension (volts)
  • binning used in acquisition
  • intended application defocus
  • ...leaving 16 undefined


See also:

FEI1 extended header

e.g. MRC2014's testset has a file named fei-extended.mrc with extended type FEI1 that has NSYMBT of 786432 (exactly 768K).

That code (mrcfile) also has a data definition for it. Each record is currently 768 bytes (this number is also the first int32), specifically:

  <i4  Metadata size
  <i4  Metadata version
  <u4  Bitmask 1
  <f8  Timestamp
  S16  Microscope type
  S16  D-Number
  S16  Application
  S16  Application version
  <f8  HT
  <f8  Dose
  <f8  Alpha tilt
  <f8  Beta tilt
  <f8  X-Stage
  <f8  Y-Stage
  <f8  Z-Stage
  <f8  Tilt axis angle
  <f8  Dual axis rotation
  <f8  Pixel size X
  <f8  Pixel size Y
  S48  Unused Range
  <f8  Defocus
  <f8  STEM Defocus
  <f8  Applied defocus
  <i4  Instrument mode
  <i4  Projection mode
  S16  Objective lens mode
  S16  High magnification mode
  <i4  Probe mode
  <b1  EFTEM On
  <f8  Magnification
  <u4  Bitmask 2
  <f8  Camera length
  <i4  Spot index
  <f8  Illuminated area
  <f8  Intensity
  <f8  Convergence angle
  S16  Illumination mode
  <b1  Wide convergence angle range
  <b1  Slit inserted
  <f8  Slit width
  <f8  Acceleration voltage offset
  <f8  Drift tube voltage
  <f8  Energy shift
  <f8  Shift offset X
  <f8  Shift offset Y
  <f8  Shift X
  <f8  Shift Y
  <f8  Integration time
  <i4  Binning Width
  <i4  Binning Height
  S16  Camera name
  <i4  Readout area left
  <i4  Readout area top
  <i4  Readout area right
  <i4  Readout area bottom
  <b1  Ceta noise reduction
  <i4  Ceta frames summed
  <b1  Direct detector electron counting
  <b1  Direct detector align frames
  <i4  Camera param reserved 0
  <i4  Camera param reserved 1
  <i4  Camera param reserved 2
  <i4  Camera param reserved 3
  <u4  Bitmask 3
  <i4  Camera param reserved 4
  <i4  Camera param reserved 5
  <i4  Camera param reserved 6
  <i4  Camera param reserved 7
  <i4  Camera param reserved 8
  <i4  Camera param reserved 9
  <b1  Phase Plate
  S16  STEM Detector name
  <f8  Gain
  <f8  Offset
  <i4  STEM param reserved 0
  <i4  STEM param reserved 1
  <i4  STEM param reserved 2
  <i4  STEM param reserved 3
  <i4  STEM param reserved 4
  <f8  Dwell time
  <f8  Frame time
  <i4  Scan size left
  <i4  Scan size top
  <i4  Scan size right
  <i4  Scan size bottom
  <f8  Full scan FOV X
  <f8  Full scan FOV Y
 <c16  Element
  <f8  Energy interval lower
  <f8  Energy interval higher
  <i4  Method
  <b1  Is dose fraction
  <i4  Fraction number
  <i4  Start frame
  <i4  End frame
  S80  Input stack filename
  <u4  Bitmask 4
  <f8  Alpha tilt min
  <f8  Alpha tilt max
SerialEM extended header

On MODE

As far as I can tell:

Standard:

  • 0: signed int8 (see notes on signed/unsigned mode 0 below)
  • 1: signed int16
  • 2: float32 (IEEE-style)
  • 3: pairs of signed_int16 (complex, for FFTs)
  • 4: pairs of float32 (complex, for FFTs)
and 6 is a typical convention (see below)


Additions:

  • IMOD has the the basic five mentioned above, plus
    • has its own flag that lets 0 mean unsigned int8 instead
    • 6: unsigned_int16
    • 16: RGB in 3*unsigned_int8
    • 101: 4-bit data
  • imsubs has the basic five mentioned above, plus
    • 5: signed_int16 (so identical to 1?(verify))
    • 6: unsigned_int16
    • 7: signed_int32
    • 101? Seems to be my misreading
  • image2000 has the basic five mentioned above, plus
    • 6 is unsigned_int16
  • FEI has the standard 0..4
    • Does not seem to mention changes (from what, though? presumably image2000?)
    • also seems to mention it's always unsigned_int16, and is mode is 6 in all EPU files I have seen so far
  • MRC2014
    • 6 is unsigned_int16


Notes:

  • Mode 6: useful and common enough to consider standardish for a while now
so supporting reading of mode 6 is a good idea these days.
(When writing 16-bit data, you could choose to check whether your data suits mode 1 (sint16) without value ambiguity - basically, if all uint16 values are in 0..32K).
  • Mode 0:
was historically signed int8, and e.g. CCP4 stands by this.
other variants would write unsigned int8 to it, which you often can detect and fix
IMOD has a flag that signals which one it's using

On images versus stacks versus volumes versus volume stacks

There are some conventions around.

MRC2014 proposses to make this a little more formal.


From CCP4's description of MRC2000 (see e.g. [2]) and some programs, it seems that:

if ISPG=0 and MZ=1 and NZ=1:
image
else if ISPG=0 and MZ=1 and NZ>1:
image_stack
else if ISPG=401:
volume_stack
else if ISPG=1 and NZ=MZ and MZ>0:
volume
else
dont_know


This does not consider movie-mode data, which in itself is okay since you can just call them stacks -- except in the case of movie-mode tilt series, e.g. as output by modern UCSFTomo. Here the details of the file are still implied by main+extended header values. You can e.g. tell

tilt series by changing tilt angle, and
movie-mode tilt series by having both changing tilt angle overall and having a bunch of adjacent ones have the same angle.

I really don't know about how much the real world follows this, and/or other conventions. e.g. one of MRC2014's example files has ISPG=4 NZ=25 MZ=72 which means what?


There are also some file extensions that suggest one or the other (at least, when NZ>1), e.g. mrcs, st / vol

On NSYMBT and NEXT

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

tl;dr: A uint32 at byte position 92..95 (in zero-based counting) marks the amount of extra bytes that follow the basic 1024-byte header, before the image data starts.

So if you read this value position-based, you're good for all formats(verify). (If you read the header into named fields, be careful of confusing NSYMBT and NEXT)


Its classical name, NSYMBT, refers to group symmetry records. The value would often be 0 (no extra data) or 80.

old formats like MRC2000, and MRC2014 call it NSYMBT


Other formats named it NEXT, probably just to indicate it can be other things

such as FEI, UCSF, IMOD
UCSF/FEI style seem to have the renamed-NEXT at the usual position (uint4 at byte position 92..95), but also a NSYMBT (uint2 at byte position 90..91). TODO: figure out what's up with that. FEI suggests it will always be zero, suggesting it can be safely ignored.


Some formats also signal what the extended header contains.

such as IMOD and MRC2014 (sort of)

On EXTRA

EXTRA refers to the MRC variants saying some parts of the main header are undefined and can be used for user-defined use.

From the view of the widest definition of EXTRA fields (that in MRC2000), most other variants (CCP4, UCSF, IMOD, FEI, MRC2014) redefine some of these for themselves. Sometimes based on others, often not so much.


On MAP

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Helps signal what sort of contents the voxel data has

Values include

  • MAP
    various.
  •     
    (i.e. four spaces) various
  • MRC
    IMG/MRC(verify)
  • IMG
    IMG-MAP(verify)
  • MTZ
    contains structure factors, see [3] [4]


In the early days it was more indicative than it is now.

e.g. CCP4 would often use
MAP
, pre-2000 MRC typically would leave it empty (not being defined there, falling in its EXTRA range), or in IMOD(verify) (though modern IMOD uses it, presumably to comply with MRC2014)

On MACHST

tl;dr:

  • Since there are enough files without a valid MACHST, you probably want endianness autodetection regardless.
  • When it is present (not everything uses it) and seems valid (not everything uses it correctly), you could choose to use it.
  • but frankly detecting endianness, and assuming floats are IEEE style, is correct for most or all modern MRC data


In earlier days, there were more standards around of coding of integers and floating point numbers. MRC record those, and endianness, of the data it contains. (presumably to allow architectures to save/load their native way for speed, while also allowing interchange in a verifiably correct way)

Since most computers these days are little endian and use IEEE style floats, most programs just assume this (or rather their native format, which is this), which usually works.


On MAPC, MAPR, MAPS

tl;dr:

Files that use unusual ordering will often have these filled, and valid (actually a permutation of 1,2,3)
Not everyone uses it, or uses it correctly.
if not filled / not valid (0,0,0, 1,1,1, or such), good luck assuming / detecting.

You can hope your detector is square, so that the most likely mistake (swapping rows and columns) happens to only be a transpose.


See also

What you can find. e.g.





STAR and CIF files

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


STAR is Self Defining Text Archival and Retrieval

CIF is Crystallographic Information Format, a restricted form of STAR that is (thereby) a little easier to use and parse.

Both are structured, but look like text files, and while you can sometimes get away with hackish parsing (e.g. when it's just storing a flat table), proper interpretation needs a real parser (there is an EBNF spec).


See also:


SerialEM mdoc

These are IMOD autodoc files, which are "named sections and key-value pairs within sections"[5] (INI-like)

...with some further conventions like that

TODO: find decent summary/reference


IMOD model files

IMOD model files are binary files that can contain contours, meshes, and surfaces, and a bunch of detail on how to draw them.

http://bio3d.colorado.edu/imod/betaDoc/binspec.html


There is also an ASCII variant, which allows only a subset but can be easier to show/transfer data.

http://bio3d.colorado.edu/imod/betaDoc/asciispec.html


Notes:

  • It seems that points (e.g. fiducials) are stored as length=1 contours
  • There are a bunch of model-file-related IMOD commands
see e.g. https://bio3d.colorado.edu/imod/doc/program_listing.html


EMI and SER

TIA (Tecnai Imaging and Analysis) is FEI software for the Tecnai and Titan microscopes (based on ES Vision from emispec, acquired by FEI). It writes:

  • .emi
contains mostly metadata, can also contain images
proprietary: file structure is not published
  • .ser (a.k.a. "ES Series Data, ESD"(verify))
contains one or more images / spectra
published details, see TIAseriesformat.pdf


What exactly TIA writes depends on what you were using it for.

Often it's emi+ser with all of the userful image in ser
In some cases you only get emi (and you can only convert to other formats from within TIA)


Software that reads SERs includes:

  • TIA, obviously. It should be installed on your FEI support PCs.
You can do batch conversion of a bunch of files
ser stacks: can read, cannot export (verify)
ser stacks: works
There is also a macro you may care about called BatchConvertSer.ijm
ser stacks: opens many windows without a clue about ordering. Batch converter takes only first image(verify)
  • xmipp-image-convert (Xmipp)
ser stack: seems to work (though failed for my case due to a code bug)
  • em2em (IMAGIC)
ser stacks: single image only? (verify)


Caveats:

  • most software does not understand emi at all
  • some software does not understand the case of ser containing more than one image (stacks)
  • software may make its own assumptions, and therefore not read all cases
  • movies/stacks in ser seem unsupported in some versions of TIA itself (and even when you can read it you apparently cannot convert it to other things from within TIA)
Other software can read it in theory, but may react differently (e.g. TIAReader opens a window for each, with no clue as to which is which frame, many other things read only the first frame)


See also:

Unsorted

mrcz

mrcz is compressed mrc.

It is a specific file structure based on the blosc library, allowing for clever adaptive choice of compression per block of data, and easier use of fast multicore compressors like lz4 and zstd.

While possibly a bit overdesigned, one of the neat ideas is that it allows fast decompression at moderate compression, so that you can get read speed close to uncompressed speeds while saving space.


Personal take:

For short-term processing, it's relevant that nothing much seems to read mrcz yet -- and that ZFS can give me lz4 fully transparent to programs (but I see that that's not an option to everyone, and meaningless for transfers)
For transfer, I use something (currently) easier to explain to everyone (e.g. gzip)
For long-term archiving (where CPU cost is one-time, so the cost comes from storage), you can use something that compresses better than lz4 (like bzip2 - though maximum-compression zstd seems to compress slightly better than it(verify))


See also:

Situs format

Apparently plain MRCs with extensions suggesting what they are and sometimes how they should be treated

vol

Spider volumes.


spe

Princeton Instruments CCD


stk

stack

st

stack


EMSA/MAS

http://www.iso.org/iso/catalogue_detail.htm?csnumber=56211

https://www.iso.org/obp/ui/#iso:std:iso:22029:ed-2:v1:en

http://journals.cambridge.org/action/displayFulltext?type=1&fid=128304&jid=MAM&volumeId=8&issueId=S02&aid=128303

Gatan

Some notes of its MRCs being IMOD-style(verify)


DM2

Written by Digital Micrograph versions 2.something. DM 2.5 changed its version tag from 200 to 250

DM3 and DM4

Are fairly similar in setup, though vary in sizes of some elements.


See also

EMX

Electron Microscopy Exchange, basically seems to be a standardization using XML to describe the most important acquisition-time parameters of CCP4-style MRC.

See also: