Spacy notes

What does it do?

SpaCy is a library focusing on some recent methods of analysing text.

It tries to be unopinionated about how things should be, and tries to separate tasks and abstractions to make it more flexible and eaasier to combine different methods in the same pipeline.

You will often use other people's pipelines, which takes away a lot of the setup bother and gives you a lot of flexibility.

At the same time, note that how each of the following tasks is done can vary a lot. A bunch of things are now done via machine learning (often neural net style) -- but others are still very much classical rule based systems.

There is also assistance to updating that machine learning.

Depending on the pipeline used, it handles a subset of:

tokenization

lemmatization

Part-of-speech tagging

token dependencies

sentence splitting

usually done by the dependency parser, though there are simpler options[1]

named entity extraction (note that the dependencies also tend to include MWEs)

and Entity Linking

similarity - of words, spans, and documents

Trained stuff like

better-trained any-of-the-above(-that-is-not-actually-implemeted-rule-based-in-the-model-you-use)

text classification

Install

Install spacy and a model of your choice.

The shortest thing to give me something that functions would be something like:

pip install -U spacy                      # CPU based install
python -m spacy download en_core_web_sm   # small english model

In real application you think a little harder about this.

Take a look at something like this, which points out that you want to:

Decide whether to use (install details will vary a little)
- GPU (faster, some more install worries, some more memory worries)
- CPU (slower, but should always work)

decide whether to install in a virtualenv / via conda (may be less pain on shared machines where you have no admin)

Decide whether to work (install details will vary a little)
- from the CLI (probably meaning system install),
- something like docker (if you can't get admin rights, or like to compare installs)
- jupyter (easy way to share experiments)
- google colab (avoiding any local install, gives you basic GPU)

Fetch models you need

like python -m spacy download model_name

...or spacy.cli.download("model_name") which basically just runs that same pip command. This style arguably makes more sense in notebooks.

Some introduction by example

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Assuming you have a basic install, start a prompt or notebook and do

import spacy
nlp = spacy.load("en_core_web_sm")      # a model you installed.  

print( nlp.pipe_names ) # to see what we're doing with it, 
# prints   ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
# and note you can get a little more via e.g.  nlp.get_pipe('tok2vec').__doc__

doc = nlp("I like Blue Bananas")

That nlp object will be of type spacy.lang.en.English (inherits from spacy.language.Language)

doc objects are of type spacy.tokens.doc.Doc. Annotation by components will annotate most things to either specific tokens, or put it on the overall Doc.

You can get a summary of what attributes a particular model sets via nlp.analyze_pipes(), but to understand what those things actually are, examples are nice:

tokenizer

tokenizer[2] does tokenisation (not optional)

stored in: Doc object itself enumerates to its Token objects (spacy.tokens.token.Token) (and token elements repr as their text content)

list(doc) 

[I, like, blue, bananas]

There are a number of convenience properties and tools, like

is_stop, is_oov

is_bracket, is_quote, is_punct

like_num, like_url, like_email

norm_, normalization mostly used to expand contractions (e.g. don't becomes two tokens with .text values do n't and .norm_ values do not) and some common abbreviations and titles [3][4][5]

Whitespace is annotated onto tokens (spacy likes to be able to construct the original exactly), but hidden from typical representation.

.whitespace_ is any trailing whitespace

.text_with_ws is basically token.text + token.whitespace_ (mostly so that you can easily print the original text as it was - you may often not care. Also note that you may still get space-only tokens regardless)

...and more

lemmatizer

lemmatizer[6] adds lemmatization (apparently often rule-based, not trained?)

stored in: Token.lemma (id), Token.lemma_ (text)

list(tok.lemma_  for tok in doc) 

['I', 'like', 'blue', 'banana']

🛈 You will see variables with underscores on the end. Many annotation attributes have both an enumerated form (e.g. an integer index into a vocabulary, compact to store), and a human-readable form just for you. The latter will have an underscore, the former will not.

tagger

tagger[7] adds POS tagging

stored in a Token's...

.tag (id), .tag_ (text) for fine-grained tags (more likely to be language-specific)

.pos (id), .pos_ (text) for coarse-grained tags

(may be decided from .tag (via an AttributeRuler[8](verify))

list(tok.pos_  for tok in doc)

['PRON', 'VERB', 'ADJ', 'NOUN']


# or 
for tok in doc:
   print('%s/%s'%( tok.text,tok.pos_ ), end=' ' )
I/PRON like/VERB Blue/PROPN Bananas/PROPN

list(tok.tag_  for tok in doc)

['PRP', 'VBP', 'JJ', 'NNS']

the actual tagsets being used will be determined by the model

...due to the way they are trained, they tend to be whatever popular parser for your language uses.

you can get a list of the labels used in components from Language object's .pipe_labels[9], or .meta['labels'] [10]

and an explanation via spacy.explain(labelstr)

there seems to a be a focus on following universaldependencies.org in general, but I've seen a bunch of little deviations

parser

parser adds

dependency parsing between tokens [11]

It is also the thing that does sentence splitting

stored in:

relations: Token.head, Token.dep, Token.dep_ (head points to the source of that relation)

sentences: Doc.sents, which are Span objects {{comment|(usually but not always done by parser)

noun chunks: Doc.noun_chunks (needs tagger and parser)

doc = nlp( "John, I gave the bean to Alice's sister" )
for tok in doc:
   print( '%10s  <--%8s--  %s '%(tok.text,  tok.dep_, tok.head) )

      John  <--npadvmod--  gave 
         ,  <--   punct--  gave 
         I  <--   nsubj--  gave 
      gave  <--    ROOT--  gave 
       the  <--     det--  bean 
      bean  <--    dobj--  gave 
        to  <--  dative--  gave 
     Alice  <--    poss--  sister 
        's  <--    case--  Alice 
    sister  <--    pobj--  to

Also note that spacy.displacy.render(doc) exists to draw that.

Noun chunks are Span objects that represent flat phrases. The .root attribute refers to the phrases's head Token - and note that its .dep is occasionally interesting (e.g. the noun chunk containing the subject token is probably the subject as a whole phrase).

for nc in doc.noun_chunks:
    print('NC %15s  head=%-15s   head.dep_=%s'%(nc.text, nc.root.text, nc.root.dep_))

NC               I  head=I                 head.dep_=nsubj
NC        the bean  head=bean              head.dep_=dobj
NC  Alice's sister  head=sister            head.dep_=pobj

doc = nlp("I like Blue Bananas. I really do.")
for sent in doc.sents:
    print( 'SENT', sent )

SENT I like Blue Bananas.
SENT I really do.

NER

ner[12] adds named entity recognition

stored in: Doc.ents (a tuple of Spans), Token.ent_*

note that MWEs are often also referenced in certain dependencies, and that noun chunks are also a thing (and note that both of those come from the parser)

Hint: Has a label_ may also be useful to do things like:

people = list(ent   for ent in doc.ents  if ent.label_=='PERSON' )   # list only the mentioned people

Like noun chunks, it may be interesting to inspect its .root

for ent in doc.ents:
    print('ENT %15s  label=%-10s   head=%-15s   head.dep_=%s'%(ent.text, ent.label_, ent.root.text, ent.root.dep_))

ENT    Blue Bananas  label=PRODUCT      head=Bananas           head.dep_=dobj

💤

For extraction, doc.ents is often easier. For more detailed work you may care to understand how this is (also) annotated on tokens - each Token also gets

ent_iob, ent_iob_ - marks where entities are

3/'B' marks the beginning of an entity

1/'I' marks 'inside'

2/'O' marks 'outside' (not inside entity)

0/ means not set

'ent_id', 'ent_id_',

'ent_type', 'ent_type_',

seems to be the label, you can spacy.explain( ent.ent_type_ ), and which you can get a list of with the model description, see e.g.

(would only be set where IOB in BI)

Potential confusion around...

Things that vary more with models an context than you may think

vectors, tensors, tensors, and vectors

spacy and transformers

tok2vec and transformers

Similarity

Slightly more technical

Visualisation

Matchers and Rulers

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Spacy provides some rule-based analysis ...and provides both Mather and Ruler classes. What's the difference?

Matchers purely return the matches ( (match_id, start_offset, start_offset) tuples)

Matcher[13] - matcher from token-based rules

PhraseMatcher[14] - similar, but a little more efficient at the task of recognizing multi-token phrases(verify)

DependencyMatcher[15] - matches within the dependency structure, and e.g. understands semgrex

Rulers tend to annotate a document, and are usually added via add_pipe (rather than than instantiated directly, like you might do with Matchers)

EntityRuler[16]

Span Ruler[17]

like EntityRuler, but more general

might be combined with SpanCategorizer[18]

Serializability

Pipelines and config

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

A pipeline is a sequence of existing components.

'Components are as separated as they can be, to minimize coupling, allow combination, and allow sharing of some data. Usually won't break if you change order or remove items.

Much of that can be overstated. The tokenizer isn't optional at all, a lot of things do rely on annotation or data from a previous item, not least of which is that various smarter parts depend on tok2vec (context-sensitive embeddings).

And if you add training to a model, it's no longer just config. It's also the data

Config stuff

In particular models are represented by a configuration.

As long as you are perfectly happy doing a spacy.load on a model someone else made, and not training, you can pretend the config stuff doesn't exist.

But once you start changing things, you need to understand how the config stuff works, and probably also why spacy also puts some focus on this to start with.

Spacy tries to make it so that anything you could instantiate from code with keywords can also be specified in a configuration dict / file (leans on factories).

And it's not just numbers you hand in- it's also instantiated classes that need to be passed along to others.

they made a powerful config system.

The config system looks awkward and opaque and like a magical DSL at first {{{1}}}, but once you understand that what it is trying to (keep the whole thing programmable, keep it easily serialized, have some type checking, and to avoid having to write a lot of boilerplate in code, and potentially even keep it readable) and the amount of instantiation and indirection that needs to go on for that, it is actually is actually fairly elegant at that.

If you look at a model, its config is a collection of all parameters.

To make it a little less scary, inspect a model's config file - large, but .

it's a file called config.cfg in the model directory

the format is INI/ConfigParser-like, but but extended with things like variable interpolation, and the reader understands references to functions and such(verify), understands 'call this function to get your actual parameters', and some other additions to make this a more composable thing[19]).

As data it's a dict (english_md.config), that looks nested but in file it just means dot-separated header names.

At any time, only part of that is used.

At runtime it may look at the [nlp] and [components.*] blocks,

while training it will look at the [initialize] and [training] blocks[20], which also helps reproducability (ideally the actual training command doesn't need to set any parameters).

A pipeline is part of the Language (nlp) object

Models often default to have components set like

tok2vec, tagger, parser, senter, attribute_ruler, lemmatizer, and ner

(to inspect a loaded model, see e.g. nlp.pipeline, nlp.pipe_names, or nlp.config['nlp']['pipeline'])

Not loading, or not using

You can

exclude components

while loading: exclude= keyword

after loading: remove_pipe(name)

disable components

while loading: disable= keyword (in v3 these are loaded but disabled)

while loading: enable= keyword (if you know the specific set you need - the rest are still loaded but disabled(verify))

after loading: disable_pipe(name)

nlp.disabled (names)

Also note that select_pipes() (previously disable_pipes()) is a context manager that has enable= and disable= and may be the briefest way to disable things temporarily (TODO: example)

There's

enable_pipe() - to re-enable a loaded-but-disabled component

has_pipe() - to test whether a component is present, by name

get_pipe() - to get the component object, by name

replace_pipe()
rename_pipe()

When augmenting or training, do you save parts or the whole thing?

On speed

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

It looks like runtime is fairly linear with amount of words (at least up to a few thousand - I've not benchmarked further yet)

It varies with model, but GPU models on GPU and CPU models on CPU may do few thousand words per second. This sounds fast for singular use, but for services this is slow - most documents will take multiple seconds, and ten people feeding in a document into a service at the same time will each have to wait half a minute to get an answer.

While on CPU you could fire of parallel processes for multiple cores, on GPU you probably want to explicitly not do that, meaning you have a blocking queue.

GPUmodels on CPU you probably do not want - it can be 100 words per second or so which is barely even enough for a single example app.

A single analysis will turn whatever size of text into a single Document, but at some size of document you will start running into issue due to the amount of memory that requires, so there is a good argument for splitting things into bite-sized sections like paragraphs.

Also, in any sort of live-ish context there is an additional argument for that: you could show the results a chunk at a time. It won't be faster but it'll feel more interactive.

At the same time...

Overhead and parallelism When dealing with a large set of texts, then consider the difference between

for text in texts:
    nlp(text)

and

nlp.pipe( texts )

Functionally these are very similar (yields a Doc for each given string), but pipe() avoids a little per-job overhead so is faster even though it is still entirely sequential.

For small amounts of texts you probably won't see any difference, but particularly for a large amount of small text fragments you will(verify).

You can also do:

nlp.pipe(texts, n_process=3)

parallel - but note there is further overhead to doing this, so is is usually only worth it for larger jobs

n_process defaults to 1. (-1 tries to detect number of cores)

this is multiprocessing style, which has significant overhead (more so in windows, which doesn't have fork()) so is only worth it if you have a lot of work

batch_size defaults to 1000 (so you may not see much if you have fewer than this many documents, or lower the batch size)

in fact, various people report worse speed until they think about parameters

presumably a multiprocessing thing so only for CPU models?(verify) Not recommended on GPU because of RAM concerns?(verify)

You may also want to avoid multiprocessing, due to an issue in torch.

Categorizing

spancat / SpanCategorizer

is sort of a more generic idea on categorizes specific token sequences (...than e.g. NER, which if you squint you could think of as a specific case of this)

https://explosion.ai/blog/spancat

might be a tool on the way towards e.g. semantic role labeling

textcat / TextCategorizer

single/multilabel text categorization task, aimed at a whole document

...which with a little extra work is easily (ab)usable for things like intent classification, sentiment analysis

textcat / TextCategorizer

In spacy v3 (in v2 this was more hidden in parameters):

textcat - mutually exclusive labels, predicts is one of them
textcat_multilabel - prediction is zero or more among the set of labels

For some tasks, both might work. Say, binary classification can be done with textcat and two labels (you care which one is selected), or textcat_multilabel with one label (you care whether it is selected or not)

If you have relatively distinct things to detect, you probably want to do that in entirely distinct components.

Using

A TextClassifer takes a Doc as input, and produces a score for each potential label class.

Training TextCategorizer

To make things interesting, there were some structural changes made between spacy v2 and v3.

We will be using v3's preferred method (which uses some command line tools),
there is also the v2 way, which is more code-based (and currently the variant more likely seen in internet tutorials)
and you can still do that in v3 (but with some changes that you won't find said v2 tutoruals)

💤 some differences

Training textcat amounts to

add textcat component to a pipeline
add valid labels to it
(train the model)
(save the model)

use the trained model

nlp.add_pipe('textcat', config=single_label_cnn_config, last=True)

nlp.add_pipe('textcat', config=single_label_cnn_config, last=True)
textcat = nlp.get_pipe('textcat')
textcat.add_label("pos")
textcat.add_label("neg")

"simple_cnn" (TextCatCNN)

CNN text classifier, given a token-to-vector model as inputs.

If exclusive_classes=True, a softmax non-linearity is applied, so that the outputs sum to 1.

If exclusive_classes=False, a logistic non-linearity is applied instead, so that outputs are in the range [0, 1].

TextCatEnsemble

linear bag-of-words model and a neural network model, where the latter uses Tok2Vec layer and attention

TextCatBOW - n-gram bag of words. Should be faster, and less accurate.

. There's a linear layer following those n-grams. The final output layer depends on whether or not the classes of your textcat are exclusive: if they are, the output layer is a softmax activation, otherwise it's a sigmoid activation layer (also called Logistic in some of our code). You can find the code implementation here:

"The model supports classification with multiple, non-mutually exclusive labels. [...] by default, the TextCategorizer class uses a convolutional neural network to assign position-sensitive vectors to each word in the document. The TextCategorizer uses its own CNN model, to avoid sharing weights with the other pipeline components. The document tensor is then summarized by concatenating max and mean pooling, and a multilayer perceptron is used to predict an output vector of length nr_class, before a logistic activation is applied elementwise. The value of each output neuron is the probability that some class is present."

textcat (mutually exclusive labels) textcat_multilabel

textcat: label documents [21]

results are stored in: Doc.cats

-->

spancat / SpanCategorizer

Training

trainable components include

tagger
morphologizer
trainable_lemmatizer
parser
ner
textcat
spancat

can do overlapping spans

Training in spacy 2 versus training in spacy 3

Training NER

External stuff

Language detection

Not part of spacy itself, but e.g. spacy-fastlang is a relatively thin wrapper around fasttext. Consider:

import spacy, spacy_fastlang

nlp = spacy.blank("xx")
nlp.add_pipe( "language_detector")

for example in ("I am cheese", "Ik ben een kaas", "Je suis un fromage", "Ich bin Käse","Ek is kaas", "saya keju", "Olen juusto"):
    doc = nlp(example)
    print( "%s (certainty: %.2f) for %r"%( doc._.language,  doc._.language_score, example) )

   en (certainty: 0.73) for  'I am cheese'
   nl (certainty: 0.53) for  'Ik ben kaas'
   fr (certainty: 1.00) for  'Je suis fromage'
   de (certainty: 1.00) for  'Ich bin Käse'
   af (certainty: 0.66) for  'Ek is kaas'
   id (certainty: 0.63) for  'saya keju'
   fi (certainty: 0.91) for  'Olen juusto'

More on sentences

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

A senter can work in one of a few ways, including:

A classical rule-based Sentencizer - pretty dumb

a senter that uses the output of the dependency parser - in fact is done as part of the dependency parse itself

If you mostly use premade models, you get whatever it uses. If that happens to be one of the simpler variants that makes rather dumb mistakes, you might want to change that.

If you want to separate that job out, do just sentence splitting, or do not want to rely on a specific large model, consider something like xx_sent_ud_sm, a 4MB multi-language senter-only model.

import spacy
nlp = spacy.load("xx_sent_ud_sm")
txt = """C'est n'est pas une pipe. A capital after Mr. Abbreviation might throw things off. (Also) weird sentence starts could. 
         As can ellipses... as you can imagine. As could "Things we quote?". As they are embedded sentences and what comes after is unknown."""
for sent in nlp(txt).sents:
    print(sent)

   C'est n'est pas une pipe.
   A capital after Mr. Abbreviation might throw things off.
   (Also) weird sentence starts could.
   As can ellipses... as you can imagine.
   As could "Things we quote?"
   ...as they are embedded sentences and what comes after is unknown.
   (and that last one would be different with a comma involved)

Technical practicalities

On speed (and GPU)

On memory

Warnings and errors

UserWarning: [W095] Model 'model_name' (3.4.0) was trained with spaCy v3.4 and may not be 100% compatible with the current version (3.5.0).

(or other versions)

For downloaded models, the mentioned command

python -m spacy validate

will give the commands to fetch update those.

GPU is not accessible. Was the library installed correctly?

If you installed before realizing you need to think about GPU dependencies, then you probably copy-pasted something like:

pip install spacy

...when you now want it to pull in the right (python) libraries for the CUDA version you have (it seems that these days you can use cuda-autodetect - though that won't always do the right thing if you have multiple versions installed)

figure out the CUDA version you have. On linux, e.g. nvidia-smi will tell you
run something like the following (this one is for CUDA 11)

pip install -U spacy[cuda110]

Spacy notes

What does it do?

Install

Some introduction by example

tokenizer

lemmatizer

tagger

parser

NER

Potential confusion around...

Things that vary more with models an context than you may think

vectors, tensors, tensors, and vectors

spacy and transformers

tok2vec and transformers

Similarity

Slightly more technical

Visualisation

Matchers and Rulers

Serializability

Pipelines and config

When augmenting or training, do you save parts or the whole thing?

On speed

Categorizing

textcat / TextCategorizer

Training TextCategorizer

spancat / SpanCategorizer

Training

Training in spacy 2 versus training in spacy 3

Training NER

External stuff

Language detection

More on sentences

Technical practicalities

On speed (and GPU)

On memory

Warnings and errors

UserWarning: [W095] Model 'model_name' (3.4.0) was trained with spaCy v3.4 and may not be 100% compatible with the current version (3.5.0).

GPU is not accessible. Was the library installed correctly?

RuntimeError: CUDA out of memory

Can I import spacy without loading tensorflow just yet?

Unsorted

Doc._

See also

Navigation menu