Questions and Answers
Wow, how great!! Thanks Duygu Altinok for doing this 🙂
My first question: What can you perform with spaCy but cannot with other NLP libraries (nltk, gensim, textblob, …)? Which features are unique?
Thanks for the question!
First of all we want offer the users complete pipelines per each supported languages. In this aspect, NLTK and textblob are similar but gensim is different.
We include the following components:
- Sentence segmenter
- Tokenizer
- Lemmatizer
- Morphologizer
- NER
- dependency parser
- POS tagger
- Rule based matcher
- Entity linker
- Vectors & semantic similarity
- Tetxcat
- Spancat
Problem with NLTK it’s not really suitable for industry-level usage. Some pretrained models are ancient but the main problem is the speed. I’d say it’s good for academical use, but not really suitable for production level NP.
Coming to unique components that you cannot find anywhere else I’d say:
- Morphologizer: This is a trainable component, calculate morphological features by looking at the word. one example:
hermosa-> singular, feminine
- Rule based matchers are definitely unique and very useful components for extracting information based on patterns.
- Tok2vec: This component is really unique allows dependency parser, named entity recognizer and pos tagger to share a common NN. Also this layer generates word vectors for
oov
words, hence feature calculations of oov words don’t fail 😉 That’s a disadvantage in many libraries, misspelled words and some words that are not in the models vocab sometimes fails, sometimes work with the statistical models. We wanted to handle oovs in a proper way. - Entity linker: This component disambiguates textual mentions (tagged as named entities) to unique identifiers, grounding the named entities into the real world via a KnowledgeBase
- Spancat: This is an uniquq and very useful component too, it can classify word spans. I’ll post an example project link.
So each component is modifiable by your own training data, easy to use and really blazing fast.
Final remark is that we also have transformer-based pipelines. I consider it pretty uniquq as well 🙂
Oh yes! Rule-based matchers are awesome 🙂
I really like playing with Matcher too, I enjoyed writing the Matcher chapter quite a lot. Also as Explosion we’ll soon publish some videos and I plan to make a Matcher video 🙂
what tasks can be done with Spacy more efficiently than a transformer?
Thanks for the question:woman-raising-hand:
Here, paradigms are bit different. Transformers provide a state-of-art way of calculating context dependent word vectors and sentence representations. Hence if you feed your sentence to a transformer, you get a word vector for each token and a dense representation of your sentence.
If you want to do text classification, POS tagging or other downstream tasks with a Transformer
- You need to find an annotated corpus
- You need to train the Transformer by adding more layers on top (which should suit your task, seq2seq for NER, POS tagger or just softmax for text classification).
spaCy’s paradigm is different, we want to provide users withpretrained
pipelines that can be usable immediately. This way you don’t worry about the training phase, directly start creating NLP applications. Here’s an example:
```>>> import spacy
>>> nlp = spacy.load(“en_core_web_md”) #Load our pretrained model
>>> doc = nlp(“I went there fast”)
>>> for token in doc:
… token, token.pos_
…
(I, ‘PRON’)
(went, ‘VERB’)
(there, ‘ADV’)
(fast, ‘ADV’)
Then do your stuff with the pos tags```
Our pipelines are either based on word vectors or on Transformers. So we use Transformers for our downstream tasks too 😉
Hi Duygu Altinok,
Does this book cover putting spacy into production?
Moins, thanks for the question. There’s no dedicated section in my book. However, we aim our pipelines to be easily usable, hence integrating spaCy and pretrained models into your porject is not difficult at all. You can just add 2 lines to your project’s requirements.txt
then you’re ready to go 🙂Here’s an example of what you can add:
spacy>=3.0.0<br />
[https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz#egg=en_core_web_sm-2.2.0](https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz#egg=en_core_web_sm-2.2.0)
This is quite pain free, usually with pretrained models
- DevOps guys need to download the model, store it on somewhere(usually) AWS
- Then put some S3 checkout lines to download the model to the project scripts
- and more work
Compared to that spaCy way is really painless😁
Thank you Duygu Altinok for the answer 🙂
Follow up question, is there any aspects of spacy pipelines which makes it better than other framework pipelines (sklearn-pipeline, transformers pipeline etc) ?
Hi Duygu, for students like me and people in academia, would you recommend spaCy over other alternatives such as Nltk? Thanks! 🙂
Yes 🙂
Please refer to this previous answer: https://datatalks-club.slack.com/archives/C01H403LKG8/p1639390846197000?thread_ts=1639386552.194400&cid=C01H403LKG8
Oh I didn’t see that answer, thanks!
Hey, Duygu Altinok. Thanks for doing this!
I have applied for many AI roles in the past year and read many NLP descriptions. I can barely recall just a few times when spaCy is mentioned. Why do you think the library is hardly mentioned when so many NLP applications are based on it?
Many thanks for this interesting question:woman-raising-hand:
This is a point that is really curious to me, too. I know few companies where the whole application depends on spaCy but still no mention in the job add. Even if I see spaCy in the add, I see sth quite generic sth like this:
- familiarity with NLP libraries such as gensim, nltk, spacy
I think the reason is copy-paste mania among HR. In the -3th company I worked for, HR was preparing the job adds and then show to us for corrections. Here’s one example of what he wrote down:
- terabytes of data (not true at all)
- state of art deep learning models (no DL, just some logistic regression)
- experience with spark (not used at all)
So, I asked him how he prepared this job add and I found out how most HRs prepare job adds: Look for some NLP job adds online and made a mix-n-match of those adds. This process converges to a very generic NLP job add at the end🧠
Seems that technical people should get more involved in the hiring process!
Is this book best for beginners, intermediate practitioners, or almost experts? Would college students interested in AI find this book useful?
I included explanations of concepts along with examples, so this is a book I %100 recommend for beginners and intermediate level colleagues. Each chapter starts with explanations, include lots of code and ends with example applications.
For expert colleagues, I think they can find the all-hands-on chapters. Especially building a chatbot chapter would be interest to all NLP lovers. Rest of the book of course has teaching material as well, so I leave this decision to NLP experts themselves 🙂
I have spent an inordinate amount of my life explaining to people how to get information from text. Do you recommend starting with the easier bag of words, tf-idf models, or jump right into more complicated embeddings?
I started from really the scratch because when I started (around 13 years ago) there was only tf-idf and bag of words 🙂 When word vectors came, I %100 understood the problem they wanted to solve and innovation they introduced. After that transformers came and the same, I appreciated the solutions that word vectors couldn’t offer.
I definitely saw the benefit of building up, when I’m vectorizing text I always know
- good sides and down sides of my approach
- shape of the vectors I generated, also the size they’ll take
- what sort of information my vectors encode
- why would my approach work
- for which cases my approach wouldn’t work and what can I do about it
So, I definitely recommend starting from scratch, also some tf-idf computations make students to warm up to the vector computations any way. However, young people are quite impatient and want to build cool stuff asap. I’m not %100sure they have the patience to practice from scratch 🙂 I’d say if this is a college class, I’d go ahead and teach it. If it’s a professional course for junior colleagues, I’d go over basics in 1 hr with some examples and then assign some reading material.
Hi Duygu Altinok Does the book cover training/finetuning Transformer as part of Spacy pipeline? If not what resources would you recommend on the topic. I was trying to integrate Spacy NER with Transformer few months back and it is not really obvious task
OK, I covered the code of how to fine tune some components including NER and textcat.
If you want to fine tune a component, it’s no so difficult ; one just needs to provide some examples with labels. For your case, you can download a transformer-based model and then fine tune NER. You can see the example of how to provide examples from my book’s code: https://github.com/PacktPublishing/Mastering-spaCy/blob/main/Chapter07/train_on_cord19.py
If you want to fine tune the transformer itself, it’s not an easy task because you need huge amounts of data. Transformers are giants, even with a huge corpus I noticed I only fined tuned maybe last 2 layers. If you want to go down this road then you need a spaCy project: https://github.com/explosion/projects/tree/v3/pipelines/ner_wikiner
Duygu Altinok
I see that spacy supports distributed training via ray. Is the text-preprocessing & inference pipelines also distributed?
No, not supported due to Cython reasons.
Duygu Altinok Is it possible to perform stemming using spaCy? If yes, how? If not, why? I know there is lemmatization but I couldn’t find stemming
Thanks for the question!
No, we don’t have a stemming component.
You can hunt down the root
of a word either by lemmatization or stemming. Lemmas are more useful in general, so we prefer using lemmas.
Duygu Altinok this question may be out of context.
I want to build an “arabic to turkish” dictionary.
I surmise that i need to use a stemmer for it.
Would you suggest a tool for arabic word stemming?
Hey ho!
For this task, you need a lemmatizer, stemmer won’t be much of help.
I know two good Arabic lemmatizers, Farasa https://farasa.qcri.org/lemmatization/ and Madamira https://camel.abudhabi.nyu.edu/madamira/
If you need further help you can always open an issue at spaCy discussions :woman-raising-hand:
Thank you very much🙂
Hi Duygu Altinok and thanks for joining us for the Q&A this week!
I’ve been using spaCy a bit recently to detect sentences in legal text. I was really impressed with the documentation and how easy it was to get started.
For now, I’m particularly interested in the more advanced chapters, as I can see myself needing to tweak the model to achieve better results and haven’t gotten into that yet.
Would you say the learning curve for the advanced chapters is as quick as getting started or does it get steeper as you progress? What other domains / resources would be useful to speed up mastery of the advanced chapters?
I’d say no, you need spend more energy on advanced chapters 🙂 Advanced chapters uses information from previous chapters, so one needs to really melt what they learnt so far into one pot to digest the content. Still, putting knowledge together and at the end how code ends up is fun 😉
Hi Duygu Altinok thanks for being here and answering our questions. I would like to ask:
Thanks for the questions!
- What project would you recommend to do if you want to get started with spaCy?
Universe has a number of projects: https://spacy.io/universe
All of the resources are great. I can recommend negspaCy, EpiTator and Rule-based Matcher Explorer for new comers 😉
- Are there drawbacks of using spaCy in comparison to other tools?
I don’t know any 🙂
- What is new in spaCy 3.0? Do you cover the new features in your book?
That’s a long list 🙂 I can recommend Ines’s video (https://www.youtube.com/watch?v=BWhh3r6W-qE) and this great page: Rule-based Matcher Explorer
My book is spaCy 3.x compatible. I focused on how to create applications withspaCy rather than spaCy inetrnals.
thank you Duygu Altinok
Hello Duygu Altinok, appreciate your taking the time this week to answer questions!
For someone getting started in NLP, would learning it using Spacy be appropriate? Or is there too much performed automatically by the library/framework that one would miss out on some basics?
Thanks for the question!
I’d say it’s a good start because one needs to understand the linguistic concepts to grasp the spaCy concepts.
If I was a starter I’d do the following:
- Get started with Keras
- Read Jurafsky’s chapters 1-7
- Get started with spaCy
- Read Jurafsky chapters 12-14
- Become more accustomed to working with Keras, work on some easy/mid-level text classification projects
- Advance with spaCy
- Do some seq2seq models with Keras
- Read rest of Jurafsky
More advanced NLP colleagues (including me 🙂 ) designs and uses different types of seq2seq archtiectures for different tasks. At the very end while coding you should always have a clear mind, almost all text based models based on a linguistic concept. Without grasping the linguistic concepts, it’s easy to get lost. This is why I offer learning spaCy, while using the library you get a chance to work with different concepts of linguistics. Hope it works 🙂
Forgot to give the Jurafsky book: https://web.stanford.edu/~jurafsky/slp3/
Thanks Duygu Altinok !
Another question… how robust are the pre-trained pipelines? In what scenarios would one need to (re)train their own model as opposed to just using the pre-trained ones.
Pretrained pipelines are quite good and usable (at least for German, English, French, Spanish and Portuguese I tried those ones myself 😁 )
Some of the projects you need your own NER definitely, but some of the projects you don’t need to if your design includes general entities such as location, person, time-date, money entities. Here’s a some example code for parsing restaurant reservation utterances (extracting ents and intent) : https://github.com/PacktPublishing/Mastering-spaCy/tree/main/Chapter10