Language units large and small
On the smaller side
Note that the interrelation between similar concepts is often not simple/orthogonal.
For example, a morpheme may contain/span multiple syllables (e.g. la·dy), but a single syllable may also consist of multiple morphemes (e.g. dog·s).
Morpheme combinations into lexemes may follow complex morphological rules.
Graphemes vs. glyphs vs. characters
In approximate use these are near-synonyms.
In more technical use (font makers, linguists) they are still regularly misunderstood, misapplied, and some not entirely unambiguous - so accept some fuzziness in the wording you find in practice.
A glyph is a particular representation of a character, while a character is the abstract concept represented by it. In other words, the same character written in a different style (say, sans serif versus calligraphic) is considered a different glyph but the same character.
A grapheme is a semantically indivisible written unit. That is, if you split it up, it would lose said meaning.
A character is usually one of:
- basic grapheme (letter / symbol),
- main grapheme plus combining graphemes (often diacritics), or
- a (full(verify)) ideogram
Of course, judgement of style difference is fairly subjective. Style difference inclue systematic differences in placement and presence of curls and lines, like the different ways of writing a 4 or 7, but many other things face a lot of disagreement. You could for example argue that mere line thickness doesn't matter, but calligraphic argues against that.
Letters refer to phonetic elements of alphabet-type writing systems.
Technically ambiguous term, symbol in this context tends to refer to graphemes that aren't phonetic, diacritics or punctuation. Symbols often stand for ideas, and are used, relatively sparsely, in languages that aren't themselves logographic.
Arguably, phonetic units in syllabaries could be called syllabograms.
Diacritics are graphemes added to phonetic characters, to alter pronunciation details.
Diacritics are often known as things that appear above or below, but appear in pretty much any way you can think of: connected, not connected, around, through, even inside.
Punctuation are roughly the non-phonetic written details of a language.
Punctuation includes punctuation marks, capitalisation word spacing, and to some degree indentation.
The phoneme is the easiest to explain, being a uniquely useful sound.
More exactly, a phoneme is a collection of phones that has taken an (indivisible) identity in a specific language.
See also articulation.
School tends to explain syllables in terms of character boundaries for some reason, but in linguistics it's defined entirely in terms of sound - generally a sonorous sound with consonant sounds around it - and in alphabetic writing systems, you cannot necessarily describe the split as a collection of graphemes.
There tend to be clear patterns to the language's speaking rhythm.
A mora (plural moras or morae) are a measure of length. Syllables are quite often one or two moras long, sometimes three.
Syllable structure is
- Onset (optional, no morae) typically a consonant
- Rhyme or Rime
- Nucleus (one/two(/three?) morae) (vowel or approximant)
- Coda (optional, one/two morae) (consonant or approximant)
For example, simple alphabet systems may have cases in which the phonemes from a grapheme belong to different syllables(verify).
See also syllabization.
US English hyphenation tends to follow syllables. UK English values etymology/morphemes more.
Morphemes are units of language that conveys meaning.
A morpheme loses a particular meaning when divided further, so are indivisible for a given meaning - though not necessarily indivisible at all. (from Morphology#Some_terms)
Also, think wider than nouns. It can be a root, a full word, an affix that inflects a word, and other things.
Morphemes are the units that morphology works with.
- in "unbelievable", the morphemes are 'un', 'believe' and 'able'
- in 'dogs', the morphemes are 'dog' and 's'
- in dog, the word is made of the single morpheme
Other examples of morphemes include 'aud' (as in hear) 'chrom' (for color), affixes such as 'ness', and many others.
A morpheme may be/act as a lexeme. Lexemes are often one, two, or generally very few morphemes.
While morphemes can be said to be (form,meaning) pairs, this does not often mean that all words are constructed in the same way; for example, plurality doesn't always come from the 's' morpheme.
In written and/or spoken form of a particular language (exceptions to one or both include languages with phonetic and ideographic scripts), morphemes may appear in slightly altered forms, often because of interaction with phonology. think elision, consonant doubling(verify), and such. (verify)
Morphemes and syllables/graphemes/phonemes
- a syllable can have multiple morphemes (as in 'dog·s')
- a morpheme can span multiple syllables (as in 'be·lieve').
- a syllable may have different meaning when used as different types of morpheme (e.g. the word "less" versus the -less suffix)
Morphemes rely on meaning, meaning they can be taken to be made of of phonemes (consider minimal pairs) as this is more exact that referring to them with a-chunk-of-letters-in-a-specific-sense.
Morphemes can often be fairly easily intuited from syllables and/or pronunciation, although neither is a very simple mapping. Complications include allomorphs ( morphemes that are realized with different phonemes, like the English 's' morpheme (plural marker) is often either or ).
Funnily, there isn't a good universal definition of a word.
Perhaps the most practical definition is "grammatical unit that can enter intro a (syntactical or morphological) construction with another at the phrase level". (which steams from the idea of a consitituent, which is basically that definition but says any level rather than phrase level).
This sounds formal though is really a more pragmatic "acts like a word" definition -- it's sort of a fancy way of saying "something you write that you can do something with, can use"
It's only a little circular ('phrase level' may need to define words), and a little fuzzy ('grammatical unit' is wide).
The concept of a word, as used by different people, is fuzzier than that, though. When you count words, you usually don't analyse grammatical units, you count spaces, i.e. are looking at orthography, not meaning.
Except when we are - we can only really get specific when we restrict it to languages and language families.
Say, in the counting-the-spaces sense, German has three times more vocabulary than almost everyone else. Not because they know a lot more words, but because they write them together, in part to clarify what modifies what in certain kinds of compounds. Most other languages are happy just writing them next to each other, or maybe hyphenating them.
Lexeme, lemma, paradigm
For example, 'run', 'running' and 'ran' are lexical units that are considered part of the same lexeme.
More formally (particularly by those with a preference for optimality theory) it is regularly considered the collection of lexical units - pairs of lexical form and meaning - that is, to the set of things which non-arguably have same basic meaning (and generally all look quite alike).
The lemma (or, in lexicography, headword) is a chosen canonical form of a lexeme.
It is a usually the least marked of all the inflected variants, and in many cases is a complete usable form.
For any one language, there there is typically consistency in terms of tense and plurality and such.
This is often also the simplest morphological form (e.g. in many languages, if the root is a free morpheme, it's that, if not it usually takes a fairly neutral bound morphemes or two (verify), though there are exceptions to all of that).
For example the lemma for 'run', 'running' and 'ran' is 'run'.
Dictionaries in most languages list things under the lemma, not any inflected forms.
- So the lemma is the form you'ld look up in there
- which form that is varies (a little) per language
The root is similar but may suggest a more linguistic, morphological take - e.g for for produce, produced, the lemma is produce, while the root may be mentioned as produc-.
In many languages more specifically meaning 'all forms of a lexeme'.
- There is sometimes no obvious choice of a lemma in a set of words for a lexeme.
- Lemmas are also sometimes understood as the whole group (instead of the abstract concept behind it) - 'all variants of a word around a specific root' (largely inflection/derivations).
- Details are noticably different (mostly simpler) in systems primarily using logographic writing, as there is often only one symbol (and/or pronunciation) for a concept, and where variations are handled by compounds of symbols (as e.g. in Japanese).
- Lexemes are often morphologically determined, and can be used to refer to a set of direct morphological variations
- Translation is often interested in lexical units, because they tend to show up as the things constant on both ends of the translation -- assuming it wasn't part of a MWE
A lexical unit, most broadly, is any part of a sentence acting as a meaningful unit and up for interacting with other parts of the sentence.
In this sense it can include:
- function words
- content words - tend to be lone lexical units, when not part of compounds
- phrasal verbs (but not all sorts of phrases - some are purely compositional parts of lexical units)
- compounds. (non-compositional compounds are more easily recognized as such by us)
- many things that fall under the header collocations or multi-word expressions can be lexical units, particularly institutionalized phrases, idioms (and some of those them may be quite long)
Longer things can act as lexical units too, including clauses, sentence/text frames, though this can be more of a push. some prefer terms like lexical chunk for longer things that behave as units.
On the larger side
Broadly and approximately...
- compounds are single words that contain multiple lexical units
- phrases are separate words that (inter)act as a single unit within a larger structure (...usually, definitions vary somewhat)
- clauses are a combination that have a predicate and a subject, but not always a complete expression of meaning (yet a good part of the way)
- sentences are typically considered one or more clauses, and are often complete expressions of meanings
The morphosyntactic reasons behind compounds seem to vary between languages.
This ends up meaning that compounds may be much like phrases in one language, but not in another.
In some contexts, linguists see little to no difference.
In others, they may disagree on a deep level.
For any one language, the details are a lot more settled and you can get a lot more specific in descriptions of what a compound is and isn't, and what it does and doesn't do.
Yet almost none of those details hold across all languages. so there exists no good general definition of either that is true for all languages, even if you stop very soon after 'multiple roots' and 'non-clausal' and 'probably sticks together in a sentence'.
This doesn't mean the terms are bad, it just means that we're trying to catch too much interesting variation, too many languages's distinct gradations, into just two words, and we just need to acknowledge this with better qualifiers, and/or pointing out our terminology is specific to the language(s) we're currently talking about...
There are many properties we can ascribe more to one than the other, but these are often soft, tend to vary between languages, and tend to have exceptions even within a language.
- compounds tend to be more compact, phrases can be longer (basically as long as they don't qualify for a clause)
- compounds are fairly fixed sequences, arguably a type of multiword expressions, (longer) phrases can have alterable structure
- a (noun) compound is more likely to name, a (noun) phrase more likely to describe
- e.g. floppy disk is a name, not a description (or rather, grew quickly from a description into a name)
- with exceptions; what is written apart so looks like phrases in German and Dutch may well act like naming compounds,
- dictionaries will generally list compounds, and rarely list phrases
- probably largely because (and when) phrases are considered compositional
- compounds are somewhat more likely to become lexicalized and less-compositional. Arguably to have become considered compounds for that reason
- "ice cream" (originally "iced cream") has its own meaning that is not purely "cream that has been iced."
- lexicalization also makes things more complex, though
- Consider that high school is a compound, small school is a phrase
- some things may exist as compounds as well as phrases
- e.g. grown-up as a compound (naming an adult), grown-up as an adjective phrase
- a number of compound nouns (often with a hyphen) look a lot like a verb phrase.
- compounds may resist e.g. inflection of its parts (probably because it touches on a specific lexicalized meaning), phrases will more easily allow it
- e.g. deep sleep could be modifier to be deeper sleep, overproduction or footpath not so much
- this could be explained as a "that would change its meaning, which you may not want" (or which may be your hobby), and not so much a "morphosyntactics forbid it", though (verify)
Complexified by language change
- the above examples exist in part because English (more than some other languages) allows open compounds, closed compounds, and hyphentated compounds, and which is the correct writing for a given compound can
- change over time
- differ with use (e.g. day to day and day-to-day may or may not differ depending on how they are used in a sentence)
- more generally, compounds may become phrases, phrases may become compounds
Complexified by other things that come from real use, not analysis.
- In langauges that compound heavily (e.g. German, Dutch), there may be a thinner boundary between the two concepts to start with
- with compounds being mostly compositional and mostly working as an interpretation aid
- In langauges that agglutinate/synthesize/inflect heavily (e.g. Turkish), there may be a thinner boundary between the two concepts to start with
- in languages that don't like agglutination, but still do it sometimes, the reasons for whehter it's smushed together or not can seem more arbitrary
- Consider how in English, compounds may change between spaced, hyphenated and solid (written as a single word) over time.
- Some compounds exist because of a sort of slightly misled lexicalization from repeated use, bastardization, or such.
And then there's things like polysynthetic languages, which can blur the lines between compound and phrase and even clause.
And then you run into concepts like phrasal compounds which seem, by their definition, to be somewhere between regular, arguably-lexicalized things like phrasal verbs and phrasal adjectives, and any multi-word expression / collocation (any pattern we can describe).
More on compounds
(much of this applies to more languages, but some of it is mor specific; TODO: separate out better)
There are various combinations of lexical categories that are possible. This is perhaps not directly interesting in terms of functionality, but certain details may apply to certain types of compounds (and you may be able to do some probablistic guessing by basing compound predictions on the observation that the types appear in different amounts)
Also known as noun compounds, nominal compounds, compound nominals and other terms, and usually clearly the most common type of compound (partly because most of a vocabulary is nouns, partly because as content lexemes, they combine easily to make specific and articulated concepts)
One of the two nouns is often substantively used as an adjective, in English usually the first noun.
Some languages allow relatively free combinations, which challenges the concept of what a word, and the nature of their relations with concepts. See also compounds. This is one of the reasons behind linguists considering lexemes to be more useful than than 'word'.
Some languages are particular fans of agglutinating its compounds - noticably German, but also (to somewhat lesser extents) other Germanic languages such as Dutch, Finnish (Finno-Ugric group), Hungarian, Swedish, and others. Other languages only do it for specific functional markers, which often (largely) considered affixes.
Native speakers of such languages often have a sense of incorrectness when such compounds are separated into the stems they are made of. That sense is often based on increased ambiguity, as agglutination is syntactic and separation could be said to be falling back on unmarked form, so provides additional/alternative interpretations and relations between the same stems. In some cases this is about ambiguity, in some it's about lexicalization, and occasionally personal preference also plays a significant role.
The incorrectness and its sense has names, like deppenleerzeichen in German, and losschrijfziekte or Engelse ziekte (literally something like 'separate writing sickness' and 'English sickness' respectively) in Dutch, the last referring to the apparent English preference for using phrases over (long) compounds (though English may still have such phrases be unsplittable, and occasionally hyphenate them to mark that - though hyphens in words usually get removed over time).
- a.k.a. Compound verb(verify)
- e.g. passer-by
- e.g. month-long
Adverb-noun compounds e.g. bystander
Adverb-verb compounds e.g. overthrow, output
Verb-adverb compounds e.g. drawback
Adjective-noun compounds e.g. redhead
Adjective-verb compounds e.g. public speaker, dry cleaning
More on phrases
Phrases aren't always called phrases
Phrases - function and (semi-formal) type
Analytically, phrases derive their main function from their head, which the phrase type is named for, and which may not be first element (as 'head' may suggest). The best known phrase types are the verb phrase (VP) and noun phrase (NP) but by the above 'from its head' definition, many other types of phrases can be said to exist, some of which are not as useful or analytically natural/neutral as others.
Note also that many are ignored to shorten blackboard/whiteboard examples.
The very basics are are usually:
It is also fairly common to see reference to:
- PP: Prepositional phrase, e.g. "through the looking glass"
- AP: Adjectival phrase, e.g. "full of toys"
- Adverbial phrase, e.g. "very carefully" (modifies a verb phrase, adjectival phrase, or clause)
- Adpositional phrase
- In XP, the X is meant as a 'fill this in' letter. It is a reference to the X-bar theory.
- Phrases often contain other types of phrases. There are various common patterns, which you can often see as grammars in parsers (often the simpler ones).
- In sentence parse trees, you also see similar elements that are more clause than phrase based, such as:
- RC: Relative Clause, e.g. "through the looking glass" (also an adjectival phrase)
- Note that the above are not always exclusive, particularly if they look for different types of functions.
A sentence is usually seen as a tree, so that phrases can contain phrases. For example, verb phrases may easily have noun phrases for their subject and object.
Phrasal verbs are semantic units that act as a verb, but are more than one word/lexeme long.
There are a number of commonly used criteria to describe behaviour of variations, but not everyone uses all the related terms in the same way. Phrasal verbs and closely related concepts are also known under names. (...such as multi-word verb (MWV) compound verb, verb-adverb combination, verb-particle construction (VPC), and others. Multi-word verbs often refer to phrasal verbs in general, or sometimes specifically verb-adverb combinations; in such a context, phrasal verb would likely refer to verb-preposition combinations.)
In English, phrasal verbs are often verb-particle combinations (fairly common), which can be subdivided into:
- prepositional phrasal verbs: verb plus preposition, e.g. ran into, look at
- adverbial phrasal verbs: verb plus adverb, e.g. drop off, work out
- verb with both an adverb and preposition, e.g. look forward to (verify)
Related terms include particle verbs, referring to phrasal verbs that contain an adverb, and prepositional verb, referring to those that contain a preposition.
- Phrasal verbs are responsible for many figurative/idiomatic/non-compositional constructions, for example "I hope you will get over your operation quickly.", "I ran into him when he ran away from home.".
- separability and transitivity is usually phrase-specific; they must often be memorized to be used correctly (and may be a problem for L2 learners)
- in many cases, the word combination in a phrasal verb can also appear in free composition - where the words have separate grammatical/semantic meaning
- Phrasal verbs are different from verbs with helpers (as this is semi-implicitly defined to be non-idiomatic)
- Many phrasal verbs have more than one possible meaning. Consider "I dropped off while dropping off a friend, just where the hill drops off."
- Phrasal verbs are regularly intransitive, but can also be transitive. ("They quickly made up a story and showed up.").
- phrasal verbs may be separable or inseparable expressions. Consider for example that "Switch off the light" is equivalent to "Switch the light off". When separable, it's mostly for inserting nouns/pronouns via re-ordering, e.g. "We quickly summed it up.
- There are various exceptions, preferences and rules specific to subtypes of phrasal verbs.
- delexicalization (specifically delexicalized verbs)
Sentence structures tends to have patterns that, if we declare phrases and types of them to be Very Real Things, would probably fall on us to then describe as behaviour of that phrase type.
General phrase types
You could say there are dependent phrases, that can only work when attached to something else.
For example, adpositional phrases and other things that act like adverbs/adpositions/adjectives, exclude a subject themselves.
For example, a Noun Phrase can be non-recursive (also: 'base noun phrase' , 'noun chunk' ) or recursive, the latter indicating that the NP is itself complex, and both its contents and structure (think tree with head) may be of semantic value. A non-recursive NP, on the other hand, is usually at most a determiner, a list of adjectives and a (possibly compound) noun. (verify) (For this reason, non-recursives are often used for simple cases in e.g. data extraction)
Phrase versus clause
You could call a phrase a syntactic structure (in the sense of being a structure at smaller syntactic scale), and a clause a more grammatical unit.
"A monkey," "to a monkey," and "gave to a monkey" could be said to be phrases,
"I gave a notebook to a monkey" is a clause - an independent clause that is also a full sentence.
The difference between an independent clause and phrase may be clear enough, yet you might might point out that dependent phrases are incomplete, like phrases, but dependent clauses still tend to act as distinct and moderately complete thoughts, phrases often do not.
The difference between phrases and dependent clauses perhaps lies in that phrases do not require subjects or a finite verb, and dependent clauses basically do (with the clause's predicate needing a verb).
A clause can be a complete sentence, or part of one.
A fairly semantic judgement based on whether the clause expresses a full thought.
Independent clauses, a.k.a. main clauses, express a complete thought. A single clause is generally a simple sentence.
Compound clauses contain at least one independent clause as the main clause.
Dependent clauses, a.k.a. subordinate clauses or embedded clauses, are clauses that contain a subject and predicate, but cannot be used as a sentence, as(/when) it does not express a complete thought (which may be a criterion you can cannot necessarily judge in grammatical terms(verify)).
(see also dependent phrase)
One reason a clause may be dependent is that it refers to a subject instead of explicitly containing one.
Another is that a dependent clauses has been made from independent clauses by having started a compound construction for some sort of elaboration, by adding a dependent word/connective, often a coordinating conjunction such as 'but,' or a relative pronoun such as 'which'.
A relative clause is a dependent clause that modifies a noun phrase (e.g. noun, pronoun).
For example, in the phrase "the man who wasn't there", the noun man is modified by relative clause who wasn't there.
The purpose of a relative clause is often to be:
In terms of syntax, relative clauses can...
- be introduced by relative pronouns, such as 'who' in the last example, used in various (indo-?)European languages
- be introduced by relativizers (a type of conjunction), as in many Semitic languages
- rely purely on positioning, such as in Japanese(, Chinese?)
A sentence is typically considered a clause, combination, and/or structural nesting of clauses.
For example: "I said I gave a notebook to a monkey." both is a clause, and contains one.
For example, there are two indepedendent clauses in "I think I'm a teapot, but I don't care", since both have their own subject and predicate, with the connective 'but' inbetween.
Utterances point out that speech is often a lot more loosely structured, in particular with incomplete fragments, and with grammar that may be less correct than it would be in writing.
Utterances often mean utterance as a unit of speech, probably split bounded by pauses, silence. This leaves it slightly open whether that's a mechanical thing (e.g. taking a breath) or an intent thing.
As such, an utterance could be multiple words or sentences, a sentence may be multiple utterances.
Utterances are more directly important e.g. in transcription.
Discourse units are mostly relevant when two or more people are talking, we tend to have units that construct the turns we take.