Data Engineering Zoomcamp: Free Data Engineering course. Register here!

DataTalks.Club

Mastering spaCy

by Duygu Altinok

The book of the week from 13 Dec 2021 to 17 Dec 2021

Build end-to-end industrial-strength NLP models using advanced morphological and syntactic features in spaCy to create real-world applications with ease.

spaCy is an industrial-grade, efficient NLP Python library. It offers various pre-trained models and ready-to-use features. Mastering spaCy provides you with end-to-end coverage of spaCy’s features and real-world applications.

By the end of this book, you’ll be able to confidently use spaCy, including its linguistic features, word vectors, and classifiers, to create your own NLP apps.

Questions and Answers

Krzysztof Ograbek

Wow, how great!! Thanks Duygu Altinok for doing this 🙂

Krzysztof Ograbek

My first question: What can you perform with spaCy but cannot with other NLP libraries (nltk, gensim, textblob, …)? Which features are unique?

Duygu Altinok

Thanks for the question!
First of all we want offer the users complete pipelines per each supported languages. In this aspect, NLTK and textblob are similar but gensim is different.
We include the following components:

  • Sentence segmenter
  • Tokenizer
  • Lemmatizer
  • Morphologizer
  • NER
  • dependency parser
  • POS tagger
  • Rule based matcher
  • Entity linker
  • Vectors & semantic similarity
  • Tetxcat
  • Spancat
Duygu Altinok

Problem with NLTK it’s not really suitable for industry-level usage. Some pretrained models are ancient but the main problem is the speed. I’d say it’s good for academical use, but not really suitable for production level NP.
Coming to unique components that you cannot find anywhere else I’d say:

  • Morphologizer: This is a trainable component, calculate morphological features by looking at the word. one example: hermosa-> singular, feminine
  • Rule based matchers are definitely unique and very useful components for extracting information based on patterns.
  • Tok2vec: This component is really unique allows dependency parser, named entity recognizer and pos tagger to share a common NN. Also this layer generates word vectors for oov words, hence feature calculations of oov words don’t fail 😉 That’s a disadvantage in many libraries, misspelled words and some words that are not in the models vocab sometimes fails, sometimes work with the statistical models. We wanted to handle oovs in a proper way.
  • Entity linker: This component disambiguates textual mentions (tagged as named entities) to unique identifiers, grounding the named entities into the real world via a KnowledgeBase
  • Spancat: This is an uniquq and very useful component too, it can classify word spans. I’ll post an example project link.
Duygu Altinok

So each component is modifiable by your own training data, easy to use and really blazing fast.
Final remark is that we also have transformer-based pipelines. I consider it pretty uniquq as well 🙂

Krzysztof Ograbek

Oh yes! Rule-based matchers are awesome 🙂

Duygu Altinok

I really like playing with Matcher too, I enjoyed writing the Matcher chapter quite a lot. Also as Explosion we’ll soon publish some videos and I plan to make a Matcher video 🙂

Doink

what tasks can be done with Spacy more efficiently than a transformer?

Duygu Altinok

Thanks for the question:woman-raising-hand:
Here, paradigms are bit different. Transformers provide a state-of-art way of calculating context dependent word vectors and sentence representations. Hence if you feed your sentence to a transformer, you get a word vector for each token and a dense representation of your sentence.
If you want to do text classification, POS tagging or other downstream tasks with a Transformer

  • You need to find an annotated corpus
  • You need to train the Transformer by adding more layers on top (which should suit your task, seq2seq for NER, POS tagger or just softmax for text classification).
    spaCy’s paradigm is different, we want to provide users with pretrained pipelines that can be usable immediately. This way you don’t worry about the training phase, directly start creating NLP applications. Here’s an example:
    ```>>> import spacy
    >>> nlp = spacy.load(“en_core_web_md”) #Load our pretrained model
    >>> doc = nlp(“I went there fast”)
    >>> for token in doc:
    … token, token.pos_

    (I, ‘PRON’)
    (went, ‘VERB’)
    (there, ‘ADV’)
    (fast, ‘ADV’)

    Then do your stuff with the pos tags```

Duygu Altinok

Our pipelines are either based on word vectors or on Transformers. So we use Transformers for our downstream tasks too 😉

WingCode

Hi Duygu Altinok,
Does this book cover putting spacy into production?

Duygu Altinok

Moins, thanks for the question. There’s no dedicated section in my book. However, we aim our pipelines to be easily usable, hence integrating spaCy and pretrained models into your porject is not difficult at all. You can just add 2 lines to your project’s requirements.txt then you’re ready to go 🙂Here’s an example of what you can add:
spacy&gt;=3.0.0<br /> [https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz#egg=en_core_web_sm-2.2.0](https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz#egg=en_core_web_sm-2.2.0)

Duygu Altinok

This is quite pain free, usually with pretrained models

  • DevOps guys need to download the model, store it on somewhere(usually) AWS
  • Then put some S3 checkout lines to download the model to the project scripts
  • and more work
    Compared to that spaCy way is really painless😁
WingCode

Thank you Duygu Altinok for the answer 🙂
Follow up question, is there any aspects of spacy pipelines which makes it better than other framework pipelines (sklearn-pipeline, transformers pipeline etc) ?

Álvaro Budría

Hi Duygu, for students like me and people in academia, would you recommend spaCy over other alternatives such as Nltk? Thanks! 🙂

Álvaro Budría

Oh I didn’t see that answer, thanks!

Matthew Emerick

Hey, Duygu Altinok. Thanks for doing this!
I have applied for many AI roles in the past year and read many NLP descriptions. I can barely recall just a few times when spaCy is mentioned. Why do you think the library is hardly mentioned when so many NLP applications are based on it?

Duygu Altinok

Many thanks for this interesting question:woman-raising-hand:
This is a point that is really curious to me, too. I know few companies where the whole application depends on spaCy but still no mention in the job add. Even if I see spaCy in the add, I see sth quite generic sth like this:

  • familiarity with NLP libraries such as gensim, nltk, spacy
Duygu Altinok

I think the reason is copy-paste mania among HR. In the -3th company I worked for, HR was preparing the job adds and then show to us for corrections. Here’s one example of what he wrote down:

  • terabytes of data (not true at all)
  • state of art deep learning models (no DL, just some logistic regression)
  • experience with spark (not used at all)
    So, I asked him how he prepared this job add and I found out how most HRs prepare job adds: Look for some NLP job adds online and made a mix-n-match of those adds. This process converges to a very generic NLP job add at the end🧠
Álvaro Budría

Seems that technical people should get more involved in the hiring process!

Matthew Emerick

Is this book best for beginners, intermediate practitioners, or almost experts? Would college students interested in AI find this book useful?

Duygu Altinok

I included explanations of concepts along with examples, so this is a book I %100 recommend for beginners and intermediate level colleagues. Each chapter starts with explanations, include lots of code and ends with example applications.
For expert colleagues, I think they can find the all-hands-on chapters. Especially building a chatbot chapter would be interest to all NLP lovers. Rest of the book of course has teaching material as well, so I leave this decision to NLP experts themselves 🙂

CJ

I have spent an inordinate amount of my life explaining to people how to get information from text. Do you recommend starting with the easier bag of words, tf-idf models, or jump right into more complicated embeddings?

Duygu Altinok

I started from really the scratch because when I started (around 13 years ago) there was only tf-idf and bag of words 🙂 When word vectors came, I %100 understood the problem they wanted to solve and innovation they introduced. After that transformers came and the same, I appreciated the solutions that word vectors couldn’t offer.
I definitely saw the benefit of building up, when I’m vectorizing text I always know

  • good sides and down sides of my approach
  • shape of the vectors I generated, also the size they’ll take
  • what sort of information my vectors encode
  • why would my approach work
  • for which cases my approach wouldn’t work and what can I do about it
    So, I definitely recommend starting from scratch, also some tf-idf computations make students to warm up to the vector computations any way. However, young people are quite impatient and want to build cool stuff asap. I’m not %100sure they have the patience to practice from scratch 🙂 I’d say if this is a college class, I’d go ahead and teach it. If it’s a professional course for junior colleagues, I’d go over basics in 1 hr with some examples and then assign some reading material.
Dmitriy Shvadskiy

Hi Duygu Altinok Does the book cover training/finetuning Transformer as part of Spacy pipeline? If not what resources would you recommend on the topic. I was trying to integrate Spacy NER with Transformer few months back and it is not really obvious task

Duygu Altinok

OK, I covered the code of how to fine tune some components including NER and textcat.
If you want to fine tune a component, it’s no so difficult ; one just needs to provide some examples with labels. For your case, you can download a transformer-based model and then fine tune NER. You can see the example of how to provide examples from my book’s code: https://github.com/PacktPublishing/Mastering-spaCy/blob/main/Chapter07/train_on_cord19.py
If you want to fine tune the transformer itself, it’s not an easy task because you need huge amounts of data. Transformers are giants, even with a huge corpus I noticed I only fined tuned maybe last 2 layers. If you want to go down this road then you need a spaCy project: https://github.com/explosion/projects/tree/v3/pipelines/ner_wikiner

WingCode

Duygu Altinok
I see that spacy supports distributed training via ray. Is the text-preprocessing & inference pipelines also distributed?

Duygu Altinok

No, not supported due to Cython reasons.

Krzysztof Ograbek

Duygu Altinok Is it possible to perform stemming using spaCy? If yes, how? If not, why? I know there is lemmatization but I couldn’t find stemming

Duygu Altinok

Thanks for the question!
No, we don’t have a stemming component.
You can hunt down the root of a word either by lemmatization or stemming. Lemmas are more useful in general, so we prefer using lemmas.

Evren Unal

Duygu Altinok this question may be out of context.
I want to build an “arabic to turkish” dictionary.
I surmise that i need to use a stemmer for it.
Would you suggest a tool for arabic word stemming?

Duygu Altinok

Hey ho!
For this task, you need a lemmatizer, stemmer won’t be much of help.
I know two good Arabic lemmatizers, Farasa https://farasa.qcri.org/lemmatization/ and Madamira https://camel.abudhabi.nyu.edu/madamira/
If you need further help you can always open an issue at spaCy discussions :woman-raising-hand:

Evren Unal

Thank you very much🙂

Noa Tamir

Hi Duygu Altinok and thanks for joining us for the Q&A this week!
I’ve been using spaCy a bit recently to detect sentences in legal text. I was really impressed with the documentation and how easy it was to get started.
For now, I’m particularly interested in the more advanced chapters, as I can see myself needing to tweak the model to achieve better results and haven’t gotten into that yet.
Would you say the learning curve for the advanced chapters is as quick as getting started or does it get steeper as you progress? What other domains / resources would be useful to speed up mastery of the advanced chapters?

Duygu Altinok

I’d say no, you need spend more energy on advanced chapters 🙂 Advanced chapters uses information from previous chapters, so one needs to really melt what they learnt so far into one pot to digest the content. Still, putting knowledge together and at the end how code ends up is fun 😉

Tim Becker

Hi Duygu Altinok thanks for being here and answering our questions. I would like to ask:

Duygu Altinok

Thanks for the questions!

Tim Becker
  • What project would you recommend to do if you want to get started with spaCy?
Duygu Altinok

Universe has a number of projects: https://spacy.io/universe
All of the resources are great. I can recommend negspaCy, EpiTator and Rule-based Matcher Explorer for new comers 😉

Tim Becker
  • Are there drawbacks of using spaCy in comparison to other tools?
Duygu Altinok

I don’t know any 🙂

Tim Becker
  • What is new in spaCy 3.0? Do you cover the new features in your book?
Duygu Altinok

That’s a long list 🙂 I can recommend Ines’s video (https://www.youtube.com/watch?v=BWhh3r6W-qE) and this great page: Rule-based Matcher Explorer
My book is spaCy 3.x compatible. I focused on how to create applications withspaCy rather than spaCy inetrnals.

Tim Becker

thank you Duygu Altinok

Allan

Hello Duygu Altinok, appreciate your taking the time this week to answer questions!
For someone getting started in NLP, would learning it using Spacy be appropriate? Or is there too much performed automatically by the library/framework that one would miss out on some basics?

Duygu Altinok

Thanks for the question!

Duygu Altinok

I’d say it’s a good start because one needs to understand the linguistic concepts to grasp the spaCy concepts.
If I was a starter I’d do the following:

  • Get started with Keras
  • Read Jurafsky’s chapters 1-7
  • Get started with spaCy
  • Read Jurafsky chapters 12-14
  • Become more accustomed to working with Keras, work on some easy/mid-level text classification projects
  • Advance with spaCy
  • Do some seq2seq models with Keras
  • Read rest of Jurafsky
    More advanced NLP colleagues (including me 🙂 ) designs and uses different types of seq2seq archtiectures for different tasks. At the very end while coding you should always have a clear mind, almost all text based models based on a linguistic concept. Without grasping the linguistic concepts, it’s easy to get lost. This is why I offer learning spaCy, while using the library you get a chance to work with different concepts of linguistics. Hope it works 🙂
Duygu Altinok

Forgot to give the Jurafsky book: https://web.stanford.edu/~jurafsky/slp3/

Allan

Thanks Duygu Altinok !

Allan

Another question… how robust are the pre-trained pipelines? In what scenarios would one need to (re)train their own model as opposed to just using the pre-trained ones.

Duygu Altinok

Pretrained pipelines are quite good and usable (at least for German, English, French, Spanish and Portuguese I tried those ones myself 😁 )
Some of the projects you need your own NER definitely, but some of the projects you don’t need to if your design includes general entities such as location, person, time-date, money entities. Here’s a some example code for parsing restaurant reservation utterances (extracting ents and intent) : https://github.com/PacktPublishing/Mastering-spaCy/tree/main/Chapter10

To take part in the book of the week event:

  • Register in our Slack
  • Join the #book-of-the-week channel
  • Ask as many questions as you'd like
  • The book authors answer questions from Monday till Thursday
  • On Friday, the authors decide who wins free copies of their book

To see other books, check the the book of the week page.

Subscribe to our weekly newsletter and join our Slack.
We'll keep you informed about our events, articles, courses, and everything else happening in the Club.


DataTalks.Club. Hosted on GitHub Pages. We use cookies.