Spacy notes
What does it do?
SpaCy is a library focusing on some recent methods of analysing text.
It tries to be unopinionated about how things should be,
and tries to separate tasks and abstractions to make it more flexible and eaasier to combine different methods in the same pipeline.
You will often use other people's pipelines,
which takes away a lot of the setup bother and gives you a lot of flexibility.
At the same time, note that how each of the following tasks is done can vary a lot. A bunch of things are now done via machine learning (often neural net style) -- but others are still very much classical rule based systems.
There is also assistance to updating that machine learning.
Depending on the pipeline used, it handles a subset of:
- tokenization
- lemmatization
- Part-of-speech tagging
- token dependencies
- sentence splitting
- usually done by the dependency parser, though there are simpler options[1]
- named entity extraction (note that the dependencies also tend to include MWEs)
- and Entity Linking
- similarity - of words, spans, and documents
Trained stuff like
- better-trained any-of-the-above(-that-is-not-actually-implemeted-rule-based-in-the-model-you-use)
- text classification
On speed
It looks like runtime is fairly linear with amount of words (at least up to a few thousand - I've not benchmarked further yet)
It varies with model, but GPU models on GPU and CPU models on CPU may do few thousand words per second.
This sounds fast for singular use, but for services this is slow - most documents will take multiple seconds,
and ten people feeding in a document into a service at the same time will each have to wait half a minute to get an answer.
While on CPU you could fire of parallel processes for multiple cores, on GPU you probably want to explicitly not do that, meaning you have a blocking queue.
GPUmodels on CPU you probably do not want - it can be 100 words per second or so which is barely even enough for a single example app.
A single analysis will turn whatever size of text into a single Document,
but at some size of document you will start running into issue due to the amount of memory that requires,
so there is a good argument for splitting things into bite-sized sections like paragraphs.
Also, in any sort of live-ish context there is an additional argument for that: you could show the results a chunk at a time. It won't be faster but it'll feel more interactive.
At the same time...
Overhead and parallelism When dealing with a large set of texts, then consider the difference between
for text in texts: nlp(text)
and
nlp.pipe( texts )
Functionally these are very similar (yields a Doc for each given string), but pipe() avoids a little per-job overhead so is faster even though it is still entirely sequential.
For small amounts of texts you probably won't see any difference, but particularly for a large amount of small text fragments you will(verify).
You can also do:
nlp.pipe(texts, n_process=3)
- parallel - but note there is further overhead to doing this, so is is usually only worth it for larger jobs
- n_process defaults to 1. (-1 tries to detect number of cores)
- this is multiprocessing style, which has significant overhead (more so in windows, which doesn't have fork()) so is only worth it if you have a lot of work
- batch_size defaults to 1000 (so you may not see much if you have fewer than this many documents, or lower the batch size)
- in fact, various people report worse speed until they think about parameters
- presumably a multiprocessing thing so only for CPU models?(verify) Not recommended on GPU because of RAM concerns?(verify)
You may also want to avoid multiprocessing, due to an issue in torch.
Install
Install spacy and a model of your choice.
The shortest thing to give me something that functions would be something like:
pip install -U spacy # CPU based install
python -m spacy download en_core_web_sm # small english model
In real application you think a little harder about this.
Take a look at something like this, which points out that you want to:
- Decide whether to use (install details will vary a little)
- GPU (faster, some more install worries, some more memory worries)
- CPU (slower, but should always work)
- decide whether to install in a virtualenv / via conda (may be less pain on shared machines where you have no admin)
- Decide whether to work (install details will vary a little)
- from the CLI (probably meaning system install),
- something like docker (if you can't get admin rights, or like to compare installs)
- jupyter (easy way to share experiments)
- google colab (avoiding any local install, gives you basic GPU)
- Fetch models you need
- like python -m spacy download model_name
- ...or spacy.cli.download("model_name") which basically just runs that same pip command. This style arguably makes more sense in notebooks.
Some introduction by example
Assuming you have a basic install, start a prompt or notebook and do
import spacy
nlp = spacy.load("en_core_web_sm") # a model you installed.
print( nlp.pipe_names ) # to see what we're doing with it,
# prints ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
# and note you can get a little more via e.g. nlp.get_pipe('tok2vec').__doc__
doc = nlp("I like Blue Bananas")
That nlp object will be of type spacy.lang.en.English (inherits from spacy.language.Language)
doc objects are of type spacy.tokens.doc.Doc.
Annotation by components will annotate most things to either specific tokens, or put it on the overall Doc.
You can get a summary of what attributes a particular model sets via nlp.analyze_pipes(),
but to understand what those things actually are, examples are nice:
tokenizer
tokenizer[2] does tokenisation (not optional)
- stored in: Doc object itself enumerates to its Token objects (spacy.tokens.token.Token) (and token elements repr as their text content)
list(doc)
[I, like, blue, bananas]
There are a number of convenience properties and tools, like
- is_stop, is_oov, is_bracket, is_quote, is_punct - see also token attributes
- norm_, normalization mostly used to expand contractions (e.g. don't becomes two tokens with .text values do n't and .norm_ values do not) and some common abbreviations and titles [3][4][5]
Whitespace is annotated onto tokens (spacy likes to be able to construct the original exactly), but hidden from typical representation.
- .whitespace_ is any trailing whitespace
- .text_with_ws is basically token.text + token.whitespace_ (mostly so that you can easily print the original text as it was - you may often not care. Also note that you may still get space-only tokens regardless)
- ...and more
lemmatizer
lemmatizer[6] adds lemmatization (apparently often rule-based, not trained?)
- stored in: Token.lemma (id), Token.lemma_ (text)
list(tok.lemma_ for tok in doc)
['I', 'like', 'blue', 'banana']
tagger
tagger[7] adds POS tagging
- stored in a Token's...
list(tok.tag_ for tok in doc) == ['PRP', 'VBP', 'JJ', 'NNS']
for tok in doc:
print( f'{tok.text}/{tok.pos_}', end=' ' )
I/PRON like/VERB Blue/PROPN Bananas/PROPN
the actual tagsets being used will be determined by the model
- ...due to the way they are trained, they tend to be whatever popular parser for your language uses.
- you can get a list of the labels used in components from Language object's .pipe_labels[9], or .meta['labels'] [10]
- and an explanation via spacy.explain(labelstr)
- there seems to a be a focus on following universaldependencies.org in general, but I've seen a bunch of little deviations
parser
parser adds
- dependency parsing between tokens [11]
- It is also the thing that does sentence splitting
- stored in:
- relations: Token.head, Token.dep, Token.dep_ (head points to the source of that relation)
- sentences: Doc.sents, which are Span objects (usually but not always done by parser)
- noun chunks: Doc.noun_chunks (needs tagger and parser)
doc = nlp( "John, I gave the bean to Alice's sister" )
for tok in doc:
print( '%10s <--%8s-- %s '%(tok.text, tok.dep_, tok.head) )
John <--npadvmod-- gave
, <-- punct-- gave
I <-- nsubj-- gave
gave <-- ROOT-- gave
the <-- det-- bean
bean <-- dobj-- gave
to <-- dative-- gave
Alice <-- poss-- sister
's <-- case-- Alice
sister <-- pobj-- to
Also note that spacy.displacy.render(doc) exists to draw that.
noun_chunks are Span objects that represent flat phrases.
Such a chunk's .root attribute refers to the phrases's head Token - and note that its .dep is occasionally interesting (e.g. the noun chunk containing the subject token is probably the subject as a whole phrase).
for nc in doc.noun_chunks:
print('NC %15s head=%-15s head.dep_=%s'%(nc.text, nc.root.text, nc.root.dep_))
NC I head=I head.dep_=nsubj
NC the bean head=bean head.dep_=dobj
NC Alice's sister head=sister head.dep_=pobj
for sent in nlp("I like Blue Bananas. I really do.").sents:
print( 'SENT', sent )
SENT I like Blue Bananas.
SENT I really do.
NER
ner[12] adds named entity recognition
- stored in: Doc.ents (a tuple of Spans), Token.ent_*
- note that MWEs are often also referenced in certain dependencies, and that noun chunks are also a thing (and note that both of those come from the parser)
Hint: Has a label_ may also be useful to do things like:
people = list(ent for ent in doc.ents if ent.label_=='PERSON' ) # list only the mentioned people
Like noun chunks, it may be interesting to inspect its .root
for ent in doc.ents:
print('ENT %15s label=%-10s head=%-15s head.dep_=%s'%(ent.text, ent.label_, ent.root.text, ent.root.dep_))
ENT Blue Bananas label=PRODUCT head=Bananas head.dep_=dobj
For extraction, doc.ents is often easier. For more detailed work you may care to understand how this is (also) annotated on tokens - each Token also gets
- ent_iob, ent_iob_ - marks where entities are (see also IOB)
- 3/'B' marks the beginning of an entity
- 1/'I' marks 'inside'
- 2/'O' marks 'outside' (not inside entity)
- 0/ means not set
- 'ent_id', 'ent_id_',
- 'ent_type', 'ent_type_',
- seems to be the label, you can spacy.explain( ent.ent_type_ ), and which you can get a list of with the model description, see e.g.
- (would only be set where IOB in BI)
token attributes
As some summary of the above, and mentioning some we haven't, the attributes on Token include: including but not limited to the following; chances are you will never use half of
- .text
- (there's .orth_ yet it seems to be an implementation detail, and practically equivalent to .text)
- the ability to fetch the original whitespace
- this example uses .text which ignores whitespace
- while .text_with_ws is basically .text + .whitespace_
- sorted, somewhat arbitrarily, from relatively calculated to more specific to the model and/or the parse:
- .lemma: lemmatized form of a token (from a lemmatizer)
- .norm: normalized form of a token (seems to mostly resolve contractions, and otherwise seems largely the same as lemmatizer output)
- .shape: Alphabetic characters become X or x (following capitalisation), numeric by d, and sequences of the same are truncated after 4, so e.g. Katherine80 would become Xxxxxdd
- specifics like .is_sent_start, .is_bracket, .is_quote, .is_left_punct, .is_punct, .is_upper; .like_url, .like_email, and a few more
- .pos_: coarse tagging (often following wider conventions), e.g. NOUN
- .tag: finer taggging (more easily model/language specific, e.g. 'N)
- .morph: morphological properties, something like 'Gender=Com|Number=Sing'
- .dep_: dependency relation in the parse - we'll get to this later
Potential confusion around...
Things that vary more with models and context than you may think
vectors, tensors, tensors, and vectors
Not guaranteed to be correct because I am also still confused
This one's a little confusing because the simple explanation of "sometimes it's just static, sometimes it's dynamic"
does absolutely nothing to explain the steps,
and possible function overrides,
that spacy is actually doing.
Worse, most introductory statements describe just the default behaviour, but do not explain that most of them can be overridden -- so may be wrong for other models.
By default there are broadly two steps:
- adding from static vectors
- adding contextual stuff (considered tensors)
But:
- both parts are optional. Yeah, both.
- transformers work differently
So:
- token.vector - is a n-dimensional number, one per token, that is
- static vectors in that they are always the same for the same word -- set in stone when the model is made.
- except where overridden
- except in models that don't have static vectors
- such as en_core_web_sm because they want to by tiny and skip semantic vectors entirely
- such and transformers like en_core_web_trf because they work differently
- "You can use hooks to customize how [.vectors] is calculated though, so a pipeline could also be providing something else there"[13]
- the meaning and scale of these numbers is up to the model. Do not assume you can compare vectors with vectors from another model.
- ...but for various of the basic English models this seems to be the GloVe vector as-is
- token.tensor -
- "context-sensitive tensors" are computed while analyzing the text, so can use context within the specific text you gave it
- if that sounds really fancy machine learning component, it certainty can be -- but it can also be worse than static vectors:
- if the model also involves static vectors, it will be combining that with .vector (verify), e.g. "if this looks like a verb then we alter the numbers to reflect that"
- but it may not. For example, for _sm this may be based entirely on token features like NORM, PREFIX, SUFFIX, etc -- and do not if it does not include static vectors i.e. nothing semantic -- this is why there are warnings to not use _sm models for much more than quick tests (md / lg, while it uses those token features, also significantly has the static word vectors(verify))
Also:
- vectors of larger units (span.vector and doc.vector) is calculated from token.vector they contain (verify)
- which implies ignoring tensors if you have them(verify)
- .similarity()
- works from .vectors
- see spacy/tokens/doc.pyx - it's basically just dot(vector, other.vector) / (self.vector_norm * other.vector_norm)
- except where overridden (by there being a 'similarity' in doc.user_hooks), e.g. in en_core_web_trf (verify)
- which implies ignoring tensors if you have them(verify) -- unless the model overrides this function specifically
- works from .vectors
- there is no built-in functionality that works from the best we have, e.g. tensors if there, vectors if not, complain if neither
- that's not necessarily easy to write, even, because the tensors may be an interesting model-specific thing
- assume transformers
- do not give vectors (and its doc.has_vector is probably False)
- ...because it stores them elsewhere (e.g. doc._.trf_data.tensor). It overrides similarity() to fetch them from where it
- so these are not the same as .tensor
- tok2vec (or similar) may be used internally
- this is not directly exposed -- it tends to just be an intermediate part between two parts of a pipeline
- It seems that tok2vec is spact's name for "whatever ends up giving tokens vectors". T
.tensor comes from a machine learning component, in tok2vec,
- except in models that do things differently, e.g. en_core_web_sm
- and except in transformers, e.g. in en_core_web_trf
- stored in doc.tensor (one axis is the tokens)
- as tok2vec component would report (e.g. english_lg.get_pipe('tok2vec').__doc__), tok2vec will "Apply a "token-to-vector" model and set its outputs in the doc.tensor attribute. This is mostly useful to share a single (sub)network between multiple components, e.g. to have one embedding and CNN network shared between a parser, tagger and NER.
- Tok2vec may use static vectors, but also other things (and in en_core_web_sm, only other things)
- this is also why .tensors need not have the same dimensions as .vectors[14]
Example: en_core_web_sm
You'ld thing this would be the simplest case, but it's not.
This has a small vocabulary (84k).
To keep it small, it has no vectors ("doesn't ship with word vectors and only use context-sensitive tensors"[15]). Fine, makes sense.
Yet .vectors is filled, what gives? It looks like it's doing context dependent stuff, because even the same word gets different vectors. (AFAICT This is counter to spacy's documentation, and presumably they're doing it so that .similarity() works without overriding that (which seems to be the proper solution?))
And .vector is the same values as .tensor. Which would be even weirder - if it weren't a hint at what's happening.
So, the .tensors attribute seem to be there for the sake of other components (like the tagger(verify)),
and as such seems to be based on some orthographic patterns. Alright, makes enough sense to me.
But it seems they then decided to put those tensors into .vector, to support some basic similarity(). It's a poor metric for similarity (there is no semantic statistics at all, and the devs indeed mention is not robust[16]), but the devs seem to figure it's better than absolutely nothing.
spacy and transformers
tok2vec and transformers
Similarity
Slightly more technical
Visualisation
Matchers
Spacy provides some rule-based matching.
An instantiated Matcher does token-based matching, that you build up from token descriptions
- docs intro to Matcher
- those token descriptions come in the form of a list of dicts; each such dict
- has a main key that is typically a token attribute
- for which the value is either
- a string to match
- an extended dict, which can use
- list logic with IN, NOT_IN, IS_SUBSET, IS_SUPERSET, INTERSECTS (and note that MORPH attrib is seen as a list(verify))
- compare int/float with ==, >=, <=, >, <
- REGEX [17]
- FUZZY [18] (edit distance based; since 3.5)
- an 'optional key OP that is a (mostly regex-like) quantifier, one of ?, +, *, {n}, {n,m}, {n,}, {,m}, and !
- The things you can use in the key, to match on, include ([see docs](https://spacy.io/usage/rule-based-matching#adding-patterns-attributes)) -- most of these mirror token attributes we've mentioned before.
- token text, derived forms, assigned by the model/parse, and some existin gpatterns
- ORTH or TEXT - the text as-is
- LENGTH - length of TEXT
- LOWER - the lowercase version of the text
- NORM - the normalized version (seems to do things like resolve contractions, and otherwise often be the lemmatizer output?)
- LEMMA - lemmatixed
- SHAPE - alphabetic characters become X or x, numeric by d, and sequences of the same are truncated after 4, so e.g. Katherine80 would become Xxxxxdd
- POS - coarse tagging (often following wider conventions), e.g. NOUN
- TAG - finer taggging (more easily model/language specific), e.g. 'N|soort|ev|basis|zijd|stan'
- MORPH - morphological properties, something like Gender=Com|Number=Sing
- DEP - dependency relation in the parse
- LIKE_NUM
- LIKE_URL
- LIKE_EMAIL
- IS_SENT_START
- IS_ALPHA, IS_ASCII, IS_DIGIT
- IS_LOWER, IS_UPPER, IS_TITLE
- IS_PUNCT, IS_SPACE, IS_STOP
- _ for values in custom attributes
- token text, derived forms, assigned by the model/parse, and some existin gpatterns
- If you want to see more interactively how these work, try https://demos.explosion.ai/matcher
For example:
{"LOWER": "hello"},
{"TEXT": {"REGEX": "^[Uu]nited$"}},
{"TAG": {"REGEX": "^V"}}
{"LENGTH": {">=": 10}}
{"IS_PUNCT": True, "OP": "?"}
{"MORPH": {"IS_SUBSET": ["Number=Sing", "Gender=Neut"]}}
# access custom attribute
{"_": {"country": {"REGEX": "^[Uu](nited|\\.?) ?[Ss](tates|\\.?)$"}}}
PhraseMatcher is a token-base matcher that you build up from a Doc
- you would often feed text in via nlp.make_doc(), which only runs that model's tokenizer, because that's the only that would be used anyway
- so this saves time over doing a fuller parse (nlp())
- seems intended to make it easier to enter large lists of multi-token phrases - of exact matches (but for the attr detail)
matcher = PhraseMatcher(nlp.vocab, attr="LOWER") # token attribute to use
patterns = [nlp.make_doc(name) for name in ["John Smith", "Mary Sue"]]
matcher.add("Names", patterns)
I'm not sure how much different it is from building a basic matcher something like:
phraselist = ['I like cheese', 'hungry like the wolf']
attr = 'LOWER'
patterns = []
for phrase in phraselist:
patterns.append( list({attr:word} for word in phrase.split()) )
- DependencyMatcher[19]
- matches within the dependency structure
- and e.g. understands semgrex
Calling matchers returns matches as (match_id, start_token_offset, end_token_offset) tuples, for you to do with as you decide.
You can add a callbacks, and a different one per add()
Rulers
Rulers are pipeline components based on matchers, and as such are usually added via add_pipe(), which also means you can specify these from configuration.
Entity Ruler
For example, adding an entity_ruler component means it will instantiate an EntityRuler which will annotate on doc.ents
{{comment|(It is also specifically designed to augment the existing ner (EntityRecognizer),
in the sense that both these components will respect earlier annotation of that sort and not create overlapping entities
(because spacy doesn't allow that for NER - look to span_ruler if you need that)
ruler = nlp.add_pipe("entity_ruler")
patterns = [ {"label": "ORG", "pattern": "Apple"}, ]
ruler.add_patterns(patterns)
SpanRuler
Adding a span_ruler component means it will instantiate a SpanRuler
- uses the same pattern format as EntityRuler
- but annotated in doc.spans, which are allowed to overlap
- might be combined with SpanCategorizer[20]
Serializability
Pipelines and config
A pipeline is a sequence of existing components.
'Components are as separated as they can be, to minimize coupling, allow combination, and allow sharing of some data.
Usually won't break if you change order or remove items.
Much of that can be overstated. The tokenizer isn't optional at all, a lot of things do rely on annotation or data from a previous item, not least of which is that various smarter parts depend on tok2vec (context-sensitive embeddings).
And if you add training to a model, it's no longer just config. It's also the data
Config stuff
In particular models are represented by a configuration.
As long as you are perfectly happy doing a spacy.load on a model someone else made, and not training, you can pretend the config stuff doesn't exist.
But once you start changing things, you need to understand how the config stuff works, and probably also why spacy also puts some focus on this to start with.
Spacy tries to make it so that anything you could instantiate from code with keywords can also be specified in a configuration dict / file (leans on factories).
And it's not just numbers you hand in- it's also instantiated classes that need to be passed along to others.
they made a powerful config system.
The config system looks awkward and opaque and like a magical DSL at first {{{1}}}, but once you understand that what it is trying to (keep the whole thing programmable, keep it easily serialized, have some type checking, and to avoid having to write a lot of boilerplate in code, and potentially even keep it readable) and the amount of instantiation and indirection that needs to go on for that, it is actually is actually fairly elegant at that.
If you look at a model, its config is a collection of all parameters.
To make it a little less scary, inspect a model's config file - large, but .
- it's a file called config.cfg in the model directory
- the format is INI/ConfigParser-like, but but extended with things like variable interpolation, and the reader understands references to functions and such(verify), understands 'call this function to get your actual parameters', and some other additions to make this a more composable thing[21]).
- As data it's a dict (english_md.config), that looks nested but in file it just means dot-separated header names.
At any time, only part of that is used.
- At runtime it may look at the [nlp] and [components.*] blocks,
- while training it will look at the [initialize] and [training] blocks[22], which also helps reproducability (ideally the actual training command doesn't need to set any parameters).
A pipeline is part of the Language (nlp) object
Models often default to have components set like
- tok2vec, tagger, parser, senter, attribute_ruler, lemmatizer, and ner
- (to inspect a loaded model, see e.g. nlp.pipeline, nlp.pipe_names, or nlp.config['nlp']['pipeline'])
Not loading, or not using
You can
- exclude components
- while loading: exclude= keyword
- after loading: remove_pipe(name)
- disable components
- while loading: disable= keyword (in v3 these are loaded but disabled)
- while loading: enable= keyword (if you know the specific set you need - the rest are still loaded but disabled(verify))
- after loading: disable_pipe(name)
nlp.disabled (names)
Also note that select_pipes() (previously disable_pipes()) is a context manager that has enable= and disable= and may be the briefest way to disable things temporarily (TODO: example)
There's
- enable_pipe() - to re-enable a loaded-but-disabled component
- has_pipe() - to test whether a component is present, by name
- get_pipe() - to get the component object, by name
- replace_pipe()
- rename_pipe()
See also:
When augmenting or training, do you save parts or the whole thing?
Trainable components
trainable components include
- tagger
- morphologizer
- trainable_lemmatizer
- parser
- ner
- textcat
- spancat
- can do overlapping spans
General spacy training notes
Training in spacy 2 versus training in spacy 3
Training NER
textcat / TextCategorizer
In spacy v3:
- textcat - mutually exclusive labels, predicts is one of them
- textcat_multilabel - prediction is zero or more among the set of labels
(more separated now; in v2 this was implied by parameters)
For some tasks, both might work. Say, binary classification can be done with
- textcat and two labels (you care which one is selected), or
- textcat_multilabel with one label (you care whether it is selected or not)
If you have relatively distinct things to detect, then you probably want to do that in entirely distinct components.
Using TextCategorizer
A TextClassifer takes a Doc as input, and produces a score for each potential label class.
Training TextCategorizer
To make things interesting, there were some structural changes made between spacy v2 and v3.
- We will be using v3's preferred method (which uses some command line tools),
- there is also the v2 way, which is more code-based (and currently the variant more likely seen in internet tutorials)
- and you can still do that in v3 (but with some changes that you won't find said v2 tutoruals)
Training textcat amounts to
- add textcat component to a pipeline
- add valid labels to it
- (train the model)
- (save the model)
- use the trained model
spancat / SpanCategorizer
Non-core useful stuff
Language detection
Not part of spacy itself, but e.g. spacy-fastlang is a relatively thin wrapper around fasttext.
Consider:
import spacy, spacy_fastlang
nlp = spacy.blank("xx")
nlp.add_pipe( "language_detector")
for example in ("I am cheese", "Ik ben een kaas", "Je suis un fromage", "Ich bin Käse","Ek is kaas", "saya keju", "Olen juusto"):
doc = nlp(example)
print( "%s (certainty: %.2f) for %r"%( doc._.language, doc._.language_score, example) )
en (certainty: 0.73) for 'I am cheese' nl (certainty: 0.53) for 'Ik ben kaas' fr (certainty: 1.00) for 'Je suis fromage' de (certainty: 1.00) for 'Ich bin Käse' af (certainty: 0.66) for 'Ek is kaas' id (certainty: 0.63) for 'saya keju' fi (certainty: 0.91) for 'Olen juusto'
More on sentences
The thing splitting into sentences can can work in one of a few ways, including:
- A classical rule-based Sentencizer
- which make some classic mistakes
- a senter component that uses the output of the dependency parser
- in fact is done as part of the dependency parse itself
- tends to make fewer mistakes
If you mostly use premade models, you get whatever it uses.
If that happens to be one of the simpler variants that makes rather dumb mistakes, you might want to change that.
If you want to separate that job out, do just sentence splitting, or do not want to rely on a specific large model,
consider something like xx_sent_ud_sm, a 4MB multi-language senter-only model.
import spacy
nlp = spacy.load("xx_sent_ud_sm")
txt = """C'est n'est pas une pipe. A capital after Mr. Abbreviation might throw things off. (Also) weird sentence starts could.
As can ellipses... as you can imagine. As could "Things we quote?". As they are embedded sentences and what comes after is unknown."""
for sent in nlp(txt).sents:
print(sent)
C'est n'est pas une pipe. A capital after Mr. Abbreviation might throw things off. (Also) weird sentence starts could. As can ellipses... as you can imagine. As could "Things we quote?" ...as they are embedded sentences and what comes after is unknown. (and that last one would be different with a comma involved)
Technical practicalities
On speed (and GPU)
On memory
Warnings and errors
UserWarning: [W095] Model 'model_name' (3.4.0) was trained with spaCy v3.4 and may not be 100% compatible with the current version (3.5.0).
(or other versions)
For downloaded models, the mentioned command
python -m spacy validate
will give the commands to fetch update those.
GPU is not accessible. Was the library installed correctly?
If you installed before realizing you need to think about GPU dependencies, then you probably copy-pasted something like:
pip install spacy
...when you now want it to pull in the right (python) libraries for the CUDA version you have (it seems that these days you can use cuda-autodetect - though that won't always do the right thing if you have multiple versions installed)
- figure out the CUDA version you have. On linux, e.g. nvidia-smi will tell you
- run something like the following (this one is for CUDA 11)
pip install -U spacy[cuda110]
or, these days, cuda-autodetect
See also pages like https://spacy.io/usage