Words and meanings: Difference between revisions

From Helpful
Jump to navigation Jump to search
Line 241: Line 241:




==Putting meanings or numbers to words==
===Knowledge base style===
===Statistic style===
====Collocations====
Collocations are statistically idiosyncratic sequences - the math that is often used asks
"do these adjacent words occur together more often than the occurrence of each individually would suggest?".
This doesn't ascribe any meaning,
it just tends to signal anything from
empty habitual etiquette,
jargon,
various [[substituted phrases]],
and many other things that go beyond purely [[compositional]] construction,
because why other than common sentence structures would they co-occur so often?
...actually, there are varied takes on how useful [[collocations]] are, and why.
====Word embeddings====
<!--
{{zzz|For context|
Where people are good at words and bad at numbers, computers are good at numbers and bad at words.
So it makes sense to express words ''as'' numbers?
...that goes a moderate way, but if you want to approximate the contained meaning ''somehow'',
that's demonstrably still too clunky.
To illustrate one limitation, you could say that 'saw' is represented by a specific number.
Such an enumeration means that "we saw the saw" cannot be expressed in numbers without those two words ''having'' to be the same thing.
You can imagine that's not a problem for, say, antidisestablishmentarianism.
There's not a lot of varied subtle variation in its use, so you can pretend has one meaning, one function.
Can't you just have multiple saws? Sure, but how do you decide?
Saw as a verb, saw as a noun? Aside from cases where variants inflect a little regularly, or not visible,
the larger problem is that that's that's somewhat circular, depending on sort-of-already-knowing.
In reality, having a good parse of sentence structure is just ''not'' independent from its meaning,
you end up having to do both concurrently.
It turns out that a lot language, ''does'' need to be resolved by context of what they do to nearby words. 
And it turns out the most commonly used words often the weirder ones.
This entanglement seems to help thing stay compact, with minimal ambiguity, and without requiring very strict rules,
which are things that natural languages seem to like to balance (even [[conlang]]s like lobjan engineer this balance),
and has some other uses - like [[double meanings]], and intentionally generalizing meanings.
So we're stuck with compact complexity.
We can try to model absolutely everything we do in each language, and that might even give us something more useful in the end.
Yet imagine for a moment a system that would just pick up 'saw after a pronoun' and 'saw after a determiner',
and not even because it knows what pronouns or determiners are, but because given a ''load'' of examples,
those are two of the things the word 'saw' happens to be next to.
Such a system ''also'' doesn't have to know that it is modeling a certain verbiness and nouniness as a result.
It might, from different types of context, perhaps learn that one of these contexts relates it to a certain tooliness as well.
If fact, such a system won't and ''can't'' explain such aspects.
So why do it?
Well, it learns these things without us telling it anything, and the similarity it ends up helping things like:
: "if I search for walk, I also get things that mention running",
: "this sentence probably says something about people moving"
: "walk is like run, and to a lesser degree swim",
: "I am making an ontology to encode these and would like some assistance to not forget things"
Also, this is an [[unsupervised]] technique. Yes, you will probably get more precise answers with the same amount of a supervised technique, i.e. with well annotated data,
but it is usually harder to have a good amount of well-annotated data (that you can have endless discussions about),
and much esier to have ''tons'' of un-annotated data.
It may pick up relations only softly, and in ways you can't easily fetch out,
but it's pretty good at not missing them,
meaning you don't have to annotate half the world's text with precise meaning
for it to work.
You just feed it lots of text.
There are limitations. It might pick up on more subtleties,
but like any unsupervised technique,
and tends to be better at finding things that at describing ''what'' it found.
You would get further by encoding 'saw as a verb' and 'saw as an implement' as different things,
sure, but that would only solves how to ''store'' knowing that. Not finding out.
This is not a great introduction, because there are multiple underlying issues here,
and we aren't even going to solve them all, we will just choose to go just half a step fuzzier,
to a point where we can maybe encode multiple uses of words (and if they have a single one, great!).
Word embedding are usually explained as "vector representation for a word"
Such vectors are useful for things like subject similarity, sentiment analysis, syntactic parsing,
The fact that it's a vector isn't that important.
It happens to be mathematically handy to work with,
and we happen to want to end up with vector values where similar vectors ''hopefully'' carry similar meanings.
'Embeddings' is a bit of a strange term for that concept,
and seems to point out (with most methods we use today) how training these considered their context
And when using the same to figure out what unseen text means, it may well assign the tool-ish sense, or the verb-ish sense, on similar context.
In fact, some patterns are strong enough that even unseen words will get a decent estimation.
Say, give we spoiaued to an embedding-style parser, and it's going to guess it's a verb ''while also'' pointing out it's out of its vocabulary.
Note that
* you won't really know what these vectors mean.
: You can fish this out, to a degree, because you can compare vectors.
: say, if any given vector compares much better to 'hammer' or to 'see' (e.g. from examples sentences), you can start to figure out what it meant.
* the thing you train does not necessarily
In ''use'', the assigned vector is typically not dependent on the context of the current,
but the vector that was learned earlier was dependent on the context in the training data. {{verify}}
(you can call this [[distributional similarity]], that a word is characterized by the company it keeps.
These vectors come from machine learning (of varying type and complexity).
* LSA
* [[word2vec]] -
: technically patented?
: T Mikolov et al. (2013) "Efficient Estimation of Word Representations in Vector Space"
* tok2vec
* FastText
: https://fasttext.cc/
* lda2vec
: https://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/#topic=38&lambda=1&term=
All of this may still apply a single vector to the same word always (sometimes called static word embeddings).
This is great for unambiguous content words, but less so for polysemy and
-->
=====Contextual word embeddings=====
<!--
The first attempts at word embeddings were typically static vectors,
meaning that the lookup (even if trained from something complex)
always gives the same vector for the same word.
This was mostly to keep the data manageable, and to keep it a simple and fast lookup.
They help in broader tasks like estimating the overall topic of a text,
and more details where ambiguity is low,
but will do poorly in the specific cases where words's meaning depends on context.
''' ''Contextual'' word embeddings''', on the other hand,
learns about words ''in a sequence''.
This is still just statistics, but the model you run will give
Depending on how much context you pay attention to,
this is even a modestly decent approach to machine translation.
Consider "we saw a bat".
In theory, it may assign
* a vector to the verb saw that is more related to seeing than to cutting
* a vector to the noun bat that is more related to other animals than it is to sports equipment
That said, that is not a given.
: There's at least four options and it might land on any of them, particularly for a sentence in isolation
: the meanings we propose to be isolated here may get weirdly blended in training
Consider "passing out" can mean anything from giving people things to losing consciousness.
Generally, the more commonly used the verb, the tricksier it is.
We like the idea of one word, one meaning, but most languages ''really'' messed that one up.
-->
=====Subword embeddings=====
<!--
A model where a word/token can be characterized by something ''smaller'' than that exact whole word.
Often just because words happen to share large fragments of characters,
(not because of stronger analysis like good given lemmatization, strongly compositional agglutination (e.g. turkish).
Chances are it will pick up on such strong patterns, but that's more a side effect of them indeed being regular)
Examples:
fastText
[https://d2l.ai/chapter_natural-language-processing-pretraining/subword-embedding.html]
-->
=====Bloom embeddings=====
<!--
A [[bloom filter]] applied to word embeddings to get better-than-nothing embeddings from something very compact.
https://explosion.ai/blog/bloom-embeddings
https://spacy.io/usage/v3-2#vectors
-->
<!--
===(something inbetween)===
====semantic folding====
https://en.wikipedia.org/wiki/Semantic_folding
-->
<!--
====Hyperspace Analogue to Language====
-->
===Could be either style===
====Semantic similarity====
<!--
Semantic similarity is the bread area of [[metric]]s between words, phrases, and/or documents.
Some people use this term specifically when it is based on strongly coded meaning/semantics (ontology style),
because this lets you make stronger statements,
contrasted with similarity based only on [[lexicographical]] details word embeddings.
...the latter is fuzzier, but also tends to give a reasonable answer to a lot of things that the more exact approach
https://en.wikipedia.org/wiki/Semantic_similarity
This can include
* words that appear in similar context (see also [[word embeddings]], even [[topic modelling]])
* words that have similar meaning
* Topological / ontological similarity - based on more strongly asserted properties
:: '''Semantic similarity''' may well refer more specifically to ''only'' "is a" and other [[ontology|ontological]] relationships,
and ''not'' just "seems to co-occur"
:: '''Semantic relatedness''' then might also include [[antonyms]] (opposites), [[meronyms]] (part of whole), [[hyponyms]]/[[hypernyms]]
The last tries to not only know e.g. 'car' and 'road' and 'driving' are related
but also roughly ''how'' they are related, in a semantic sense.
between documents, or between
Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness
-->
===The distributional hypothesis===
The distributional hypothesis is the idea that
words that are used and occur in the same contexts tend to convey similar meanings - "a word is characterized by the company it keeps".
This idea is known under a few names,
but note that few of them really describe a technique,
or even the specific assumptions they make.
<!--
Distributional Similarity
Distributional semantics
https://en.wikipedia.org/wiki/Distributional_semantics
'''Distributional similarity''' can refer to analysis to bring those out, often for the goal of figuring the relevant semantics.
for example on noun-verb combinations, this can be referred to as .
This also means being able to, in that example, being able to e.g. predict noun similarity based on their likeliness of combination with the same verbs.
-->
===Moderately specific ideas and calculations===
====latent semantic analysis====
Latent Semantic Analysis (LSA) is the application of [[Singular Value Decomposition]] on text analysis and search.
====random indexing====
https://en.wikipedia.org/wiki/Random_indexing
====Topic modeling====
Roughly the idea given documents that are about a particular topic,
one would expect particular words to appear in the each more or less frequently.
Assuming such documents sharing topics,
you can probably find groups of words that belong to those topics.
Assuming each document is primarily about one topic,
you can expect a larger set of documents to yield multiple topics,
and an assignment of one or more of these topics, so act like a soft/[[fuzzy clustering]].
This is a ''relatively'' weak proposal in that it relies on a number of assumptions,
but given that it requires zero training,
it works better than you might expect when those assumptions are met.
(the largest probably being your documents having singular topics).
https://en.wikipedia.org/wiki/Topic_model


==See also==
==See also==

Revision as of 23:07, 22 March 2024

Language units large and small

Marked forms of words - Inflection, Derivation, Declension, Conjugation · Diminutive, Augmentative

Groups and categories and properties of words - Syntactic and lexical categories · Grammatical cases · Correlatives · Expletives · Adjuncts

Words and meaning - Morphology · Lexicology · Semiotics · Onomasiology · Figures of speech, expressions, phraseology, etc. · Word similarity · Ambiguity · Modality ·

Segment function, interaction, reference - Clitics · Apposition· Parataxis, Hypotaxis· Attributive· Binding · Coordinations · Word and concept reference

Sentence structure and style - Agreement · Ellipsis· Hedging

Phonology - Articulation · Formants· Prosody · Sound change · Intonation, stress, focus · Diphones · Intervocalic · Glottal stop · Vowel_diagrams · Elision · Ablaut_and_umlaut · Phonics


Analyses, models, software - Minimal pairs · Concordances · Linguistics software · Some_relatively_basic_text_processing · Word embeddings · Semantic similarity

Unsorted - Contextualism · · Text summarization · Accent, Dialect, Language · Pidgin, Creole · Natural language typology · Writing_systems · Typography, orthography · Digraphs, ligatures, dipthongs · More linguistic terms and descriptions ·


Semiotics

Semiotics can be taken as the study of signs and how we relate them to meaning - any communication and any part of it.

Signs in this context are are anything that can be interpreted to have a meaning, including but not limited to sounds, motions, gesture, images.

(Meaningful things aren't even limited to things done intentionally. Signs that are present but not made intentionally are e.f. those used in medical diagnosis, as a symptom can be a a sign of a medical condition.)


In a practical sense we often still focus on words, yet the term is used in part to remind you these are far from the only meaning-carriers, even in diagolgue you consider to be word-based first.


Sign process is sometimes synonym for semiotics, arguably a more self-explanatory name to those not already deep in the theory.

Sign process is also sometimes a little more specifically meant as a "any process/activity that involves signs, and probably meaning".



You can argue that linguistics is mostly about intentional meaning, in a sense we try to keep compact and practical and structural, and semiology is a wider, more anthropological thing about any signs and symbols anyone may have used along the way.

As such, linguistics courses may skim over semiology, or use the term only for some of the more symbolic inbetweens we meet - analogy, allegory, metonymy, metaphor, symbolism, and also conduct, behaviour, and a lot of other sociology.


It also overlaps with philosophy, relating to structuralism, and more. (see e.g. Saussure)



See also:

Philology, Etymology

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Etymology is the study of the history of words and phrases, primarily reporting use, change, transfer between languages and such, often comprising of finding and proving (or disproving) these genetic-like relations.

While it is focused a little less on meaning, it often ends up mentioning things in terms of morphology simply because a lot of carried meaning comes from morphemes.


The term is related to philology, which refers to the extensive study of historical linguistics, which may easily include many aspects of a language at a time - grammar, culture, relevant politics, and more.



You can usually see systematic differences between cognates between similar languages in terms of phonetics and meaning. Partly because of this, a game of guess-the-cognate plays a role in second language vocabulary aquisition. Of course, it's easy to guess false cognates and false friends this way and get rather confused.

Loaning across languages can introduce various cognate relations. Consider for example:


False etymology, folk etymology

People regularly guess at etymologies.

And, with some regularlity, incorrectly.

They may sound quite plausible, and some get a status similar to urban legend -- they just won't die even though they're not hard to falsify.


This also sometimes leads to changes how these words are used, which are essentially neologisms.

Cognates

Cognates are words that are related by origin, by common ancestry, but have started developing independently, and as such now have distinct meanings and etymologies.

Meaning may have drifted apart, or are similar and now serve to distinguish two similar concepts.

For example, the English skirt and shirt have a common Old English origin.


May be used to refer to cognates within a language, but note that words that sound about the same across languages, particularly those in the same language family, are also commonly considered cognates - often because of their common origin and their development within their language.

For example, the word 'night' looks and/or sounds similar between most Indo-European languages.


Various cognates grow from long-term development, from common origins that unites two languages in the language trees. For example, English and German are fairly closely related, while English and Spanish's common ground is mostly in Latin.


False cognates

False cognates are words that are thought to be similar, usually because they look and sound it, but do not share a direct or common root.

For example, Latin habere (to have) and German haben (to have) seem like likely cognates, but these two words come from different origins. Tracing them back reveals that their similarity is coincidental; the sounds changed over time, and from distinct origins.


Doublets (etymological twins)

Repeated loaning from the same origin, often at different times, creates doublets, a.k.a. etymological twins: words with the same origin, but that are made distinct in the loaning process.

Could be considered a special case of cognates.

Regularly have similar but distinct meanings and/or uses, often (near-)synonyms, sometimes (near-)antonyms.


For example, English has

fire and pyre,
aperture and overture
carton and cartoon,
and, less recognizably, sovereign and soprano


Triplets also exist, but are fairly rare.

False friends

False friends are words that look quite similar between languages, but significantly differ in meaning.

They are easily misidentified when the words are homonyms, or cognates (false or not).

False friend recognition probably happens most often across languages. Literal translation of regular sentences and especially idioms regularly introduces false friends.

See also:

  • Linguistic inference / linguistic transfer

Onomastics

Onomastics, also known as onomatology, is the study of names, and usually refers to its etymological aspects,like figuring out where they came from and how they evolved.


Not to be confused with Onomasiology (the means of expressing concepts).


Lexicology

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Lexicology is the general and objective study of words and their meanings.

In some ways, it is the lexical part of philology.

In other ways it is closely related to etymology, phraseology, and semantics.


Lexicography can be said to be the applied part of lexicology, as it studies the use of words.

'Lexicography' is also and is often used to refer to the compilation of a lexicon. Note that lexicographer is a comparatively specific word, referring to someone who writes dictionaries.


Onomasiology

Onomasiology concerns itself with the means of expressing concepts in language - often in either the context of lexicology, or more widely in the sense of lexical semantics.


Not to be confused with onomastics/onomatology, the somewhat more specific study of names.


Semasiology

https://en.wikipedia.org/wiki/Semasiology


See also

Etymological websites:

Names: