Binary files, text files: Difference between revisions
mNo edit summary |
mNo edit summary |
||
(One intermediate revision by the same user not shown) | |||
Line 2: | Line 2: | ||
===What do these terms even mean?=== | ===What do these terms even mean?=== | ||
Pragmatically, | |||
* text file = "All data is useful as text" | |||
: characters in a sequence that you could edit at will in the simplest types of "characters after another" style editor | |||
: human-interpretable, human-editable | |||
* binary file = "not just text". It's a catch-all. | |||
: a binary file is one you probably can't edit without severely breaking the present structure | |||
:: and where it probably wouldn't occur to you, e.g. because the most useful data isn't text to start with. | |||
: probably not human-readable, probably not human-editable | |||
Even that needs footnotes, and we haven't even gotten technical yet. | |||
'Binary' seems to come from a time before a lot of different file formats existed, | |||
where computer use was computer programming, | |||
and where we mostly had code that humans wrote, | |||
and code in compiled, machine-readable form. | |||
The compiler output was ofetn called 'the binary', and that is still used. | |||
So arguably it's short for 'a binary executable' or some such term. | |||
<!-- | |||
'Binary data' or 'binary file' is actually a fairly empty and dumb name, because in this context it means "could be anything, but not just text". | |||
--> | |||
<!-- | |||
And more pedantically, everything is just as much made of ones and zeroes as anyhting else when stored, ''and'' [[the ones and zeroes thing is a dumb trope|we ever look at data that way to start with]]. | |||
--> | |||
<!-- | |||
Even if text is involved, you can't be entirely sure of how to interpret or edit it without | |||
parsing the file according to whatever standard the file is encoded to (which may be de facto, or even non-portable serialization). | |||
--> | |||
<!--(More structured documents formats have solved this decades ago)--> | |||
<!-- | |||
If a file or (byte)string contains only text (particularly if in a common coding like ASCII, ISO8859, UTF8) it would often be called '''plain text'''. | If a file or (byte)string contains only text (particularly if in a common coding like ASCII, ISO8859, UTF8) it would often be called '''plain text'''. | ||
--> | |||
A ''' | <!-- | ||
In programming | |||
* A '''string''' in the wide sense refers to a array of values | |||
: ''usually'' to a string of readable characters (unless terms like bytestring are used). | |||
:: {{comment|(...in part because we have words like array and list for numbers and other things)}} | |||
Around C/C++ and some others, | * A '''bytestring''' (sometimes binary string) is a sequence of bytes that can contain any value, not just readable characters. | ||
: Around C/C++ and some others, a string is terminated by a ''value'' -- which means that value cannot appear in the data. That means that for bytestrings, you must store the length separately. | |||
--> | |||
===Plain text file=== | |||
'''There is arguably no such thing as a plain text file''' | |||
That is, it turns out there are a number of ways to encode special characters, | |||
that are hard to distinguisable except for guessing hard. | |||
...exactly because in a just-characters file, we chose ''not'' to store what encoding is being used. | |||
Depending on the encoding, there may be values you don't expect in another type of text file. | |||
You need to know more about the data that's in there to read it correctly. |
Latest revision as of 14:13, 16 January 2024
What do these terms even mean?
Pragmatically,
- text file = "All data is useful as text"
- characters in a sequence that you could edit at will in the simplest types of "characters after another" style editor
- human-interpretable, human-editable
- binary file = "not just text". It's a catch-all.
- a binary file is one you probably can't edit without severely breaking the present structure
- and where it probably wouldn't occur to you, e.g. because the most useful data isn't text to start with.
- probably not human-readable, probably not human-editable
Even that needs footnotes, and we haven't even gotten technical yet.
'Binary' seems to come from a time before a lot of different file formats existed,
where computer use was computer programming,
and where we mostly had code that humans wrote,
and code in compiled, machine-readable form.
The compiler output was ofetn called 'the binary', and that is still used. So arguably it's short for 'a binary executable' or some such term.
Plain text file
There is arguably no such thing as a plain text file
That is, it turns out there are a number of ways to encode special characters, that are hard to distinguisable except for guessing hard.
...exactly because in a just-characters file, we chose not to store what encoding is being used.
Depending on the encoding, there may be values you don't expect in another type of text file.
You need to know more about the data that's in there to read it correctly.