MLOps Zoomcamp: Free MLOps course. Register here!


Natural Language Processing with Transformers

by Leandro von Werra, Lewis Tunstall, Thomas Wolf

The book of the week from 25 Apr 2022 to 29 Apr 2022

Since their introduction in 2017, transformers have quickly become the dominant architecture for achieving state-of-the-art results on a variety of natural language processing tasks. If you’re a data scientist or coder, this practical book shows you how to train and scale these large models using Hugging Face Transformers, a Python-based deep learning library.

Transformers have been used to write realistic news stories, improve Google Search queries, and even create chatbots that tell corny jokes. In this guide, authors Lewis Tunstall, Leandro von Werra, and Thomas Wolf, among the creators of Hugging Face Transformers, use a hands-on approach to teach you how transformers work and how to integrate them in your applications. You’ll quickly learn a variety of tasks they can help you solve.

  • Build, debug, and optimize transformer models for core NLP tasks, such as text classification, named entity recognition, and question answering
  • Learn how transformers can be used for cross-lingual transfer learning
  • Apply transformers in real-world scenarios where labeled data is scarce
  • Make transformer models efficient for deployment using techniques such as distillation, pruning, and quantization
  • Train transformers from scratch and learn how to scale to multiple GPUs and distributed environments

Questions and Answers

leandro von werra

Hi everyone, looking forward to answering all of your questions!

Lewis Tunstall

G’day everyone, it’s great to be here 🤗 ! Looking forward to your questions too 🙂

Igor Dal Bo

Hi guys, thanks for taking the time to answer our questions 🙂

  • what is the knowledge level needed for this book?
  • which role do transformers play in topic modelling?
Lewis Tunstall

Hi Igor, thanks for your questions!

  1. The book assumes some familiarity with deep learning and PyTorch, so we recommend reading it after working through e.g. the course / book.
  2. Are you referring to unsupervised topic modelling, e.g. cluster my texts into a set of topics? For this, there’s a neat library called BERTopic ( that get’s quite good results in my experience. The alternative is to use sentence-transformers to embed your documents ( and then apply clustering on the embeddings - this too gives quite good results.
Igor Dal Bo

thanks Lewis Tunstall for your answers 🙂

  • I had in mind taking that course for a long while, your suggestion add another point to it 😄
  • that is exactly what I wanted to know, thanks a lot for pointing me out to the right resources. So far what I have been working on was a supervised approach, but defining a topic dictionary manually requires a lot of time and knowledge..
Gonzalo Ancochea Blanco

Hi!! Thanks Leandro, Lewis and Thomas for sharing some of your time!
Through your QA chapter I became very interested in Haystack. Retrieval+extraction/generation is very relevant for our team’s needs and I’ve seen they have more nodes but we also need flexibility to add any other NLP components to the pipeline (e.g. NER, topic modelling, sentiment…)
Do you have any experience with building custom Haystack nodes and using the framework to implement full pipelines, and if so what is your take on it? Any alternatives? Thanks again 😄

Lewis Tunstall

Hi Gonzalo, thanks for your question!
I believe that haystack did a major overhaul of their Pipeline abstraction in v1 (which unfortunately came out after our book was in production), and my understanding is that it now supports several NLP tasks like summarization (i.e. summarize the retrieved docs):
For a completely custom pipeline, I’d suggest joining their Slack group:
They’re really responsive and helpful for these kind of questions!

Gonzalo Ancochea Blanco

Thanks a ton Lewis Tunstall I will check out the source code of Pipeline and definitely join the Slack group!

Alexander Seifert

So awesome, good to have you guys here! My first question would be: Being both NLP practitioner and book publisher I know that (1) you often can’t include everything you want into a book, and (2) the field of NLP moves faster than the printing presses. Were there things you had to cut that you would have liked to include, and what advances in the field that came about since you handed off the manuscript make you excited to have included in the next edition of the book if there was one?

leandro von werra

When we started working on the book multi-modality was not really a thing, yet. Since than CV, Audio, Tabular applications etc. were also conquered by transformers. We were able to give a brief overview of what exists in the last chapter but obviously it would have been nice to include them in more detail and show how to train them. Maybe v2 of the book needs to be renamed to Machine Learning with Transformers 🙂
It helped a lot to work closely with the maintainers of transformers and datasets to plan a bit around upcoming features or features that would soon be deprecated. E.g the streaming API used in the Training transformers from scratch chapter was finalized at the same time as we wrote the chapter.

Alexander Seifert

I heard that you wrote the entire manuscript inside Jupyter notebooks. How was your experience with that toolchain for writing a whole book, and what did O’Reilly think about that? I imagine it’s still somewhat unusual.

Lewis Tunstall

We used the fastdoc library ( to handle all the conversion from notebooks to Asciidoc (which is O’Reilly’s preferred format).
Overall we found the experience was rather smooth, and we were lucky that O’Reilly already had experience from the fastai book 🙂
We heard from the production editors that they’re getting more submissions in this format, so I think notebooks might become the norm for these kind of publications - it really makes writing a breeze when everything is in one place!

leandro von werra

One point to addd here is that the O’Reilly team was happy to keep the notebook -> asciidoc conversion pipeline in place until the very last reviewing process which kept the notebooks in sync with the formatting/copy-editing changes.
So it was really nice of them production team to adapt to this new workflow. As a nice side effect the whole book is nicely git version controlled 🙂

Max Payne

Hi, great project and book indeed,
I have a question regarding explainability, which is gaining a lot of attention. With such a large model/training data, is it possible for these APIs to ‘explain’ their decision?

Lewis Tunstall

Hi Max, thanks for the question! As far as I know, explainability of transformers is still an open research problem and one that we didn’t have space to cover in the book.
One of the best examples I know of is Jay Alammar’s ecco library ( which uses various visualisation techniques to understand which factors lead to a specific prediction. It doesn’t give you an explanation “for free”, since you have to do some work to interpret these results, but I think it might be interesting to you

Max Payne

Thanks alot

Max Payne

Also, regarding explainability, don’t you think that the paper ‘Pretrained Transformers as Universal Computation Engines’ sort of makes it even harder, since it claims something like there is something special in the architecture that makes it generalizable and hence making it more difficult to justify the outputs?

Alexander Seifert

I am working on a human-in-the-loop NER system, where I’d like to use the prediction scores as a proxy for correctness. As far as I understand, this is called “uncertainty estimation” and it is not something that transformer models are providing out-of-the-box. Is there content on this problem in the book, and if not, are there any good resources you would recommend?

leandro von werra

So if by prediction scores you mean the model’s output probabilities then yu can actually get them fairly easy from the transformer models: For each token the model will return the logits and by applying a SoftMax function you’ll get each classes probability.
In the book we show how to get the logits from the model for NER so the only thing that’s missing to get probabilities is applying SoftMax. Hope that helps!

Alexander Seifert

the problem is that the softmax will indeed squish the logits into a probability distribution (formally), but this distribution is unfortunately poorly calibrated to the correctness likelihood:


Thanks for this wonderful book! I am looking forward to reading it. Writing a book is a very time consuming task so I have two general questions: 1) how long did it take you to do it? 2) what was the most challenging section of the book to write?

leandro von werra
  1. It took as roughly 1.5 years to write the book. Since we did it in our spare time this means that we mostly worked on it on weekends 🙂
  2. For me working on the few labels and pretraining from scratch were the most challenging. The few labels chapter required a lot of experimentation and research to first figure out what works well. The pretraining chapter was one of my first contacts with distributed training and in addition we used a lot of features of the hugging face ecosystem that were brand new.
    Maybe the others have other chapters they found challenging 🙂

Hi, I am a beginner in this field. I heard from the above comment that you use Jupyter notebooks to write your code. Can I follow your code step by step if I use Collab?

leandro von werra

Yes you can! See the table with Colab links in the book’s repository:


In case my interesting NLP task is the Text summarization. Is there a connected issue to this topic?

leandro von werra

You could look at chapter 6 where we show how to train a transformer model on text summarization. All the book’s code is freely available in the repository (but without the text from the book, so no explanations).

Amruta Ranade

Hi leandro von werra, the concept of transformers is new to me and am learning about their architectures and functions currently. I was interested in knowing how transfer learning can be adopted with the transformers and are there any specific applications that you have covered in this book?

leandro von werra

Hi Amruta Ranade, yes transfer learning can be applied to transformers and is in fact the driving force behind their success. Except for the last chapter where we train a model from scratch all chapters show how to apply transfer learning to tune a pretrained model on a specific task such as classification, NER, QA, or summarization.

Amruta Ranade

Thankyou for the information ! Sounds interesting to me. I am definitely going to buy your book! 😊

Sitao Zhang

Congrats HF team! !
This is Sitao from the J&J team, really glad to see you guys published this wonderful book and been awarded as the <#C01H403LKG8|book-of-the-week>.👏🎉
Just a glance of the book description, I’m curious about:

  1. Whats the major difference between this book and the HF documentations online?
  2. Since I have been leveraging the transformer for a while, which part/parts of the book do you mostly recommend for people who have some experience?
  3. Not sure whether the HF BigScience will be cover here? I know the BigScience is super recent, maybe the similar concepts will be discussed in the book?
    Thank you guys!👍
leandro von werra

Hi Siato, thanks for being here!

  1. What we try in the book is to be much more practical than the documentation, which is why each chapter aims to solve a realistic use-case. So it involves more steps like preparing your data or doing some error analysis of your model.
  2. In general the later chapters in the book are mode advanced (e.g. making transformers efficient for prod, few labels or pretraining from scratch). Almost every chapter also tackles a different task (classification, ner, qa, summarization etc.). So you could also pick a task that you haven’t encountered before.
  3. Unfortunately, we don’t go into detail of BigScience. The very last chapter looks a bit ahead especially at the scaling trend of transformers and multimodal transformers.
Lalit Pagaria

Thanks team for conducting this QnA.
Have following query -
HF being a leader in practical deployment of transformers model and super heading collaborative research initiative like Big Science. What are practical pipelines (pre-processing) must before giving input to transformers (production setup)?

leandro von werra

So I think this can be divided into two categories:

  • data cleaning: before you train your model you want to make sure that the data is as clean as possible. for BigScience we spent a lot of time cleaning and filtering the data. this includes: removing duplicates, deleting html code from websites and so on. whatever you do to your training data here you should also do to the inputs in production. always clean your data 🙂
  • tokenization: there are a few steps that besides splitting the string into tokens that the tokenizer takes care of such as unicode normalization or lower casing all strings. as long as you use the same tokenizer in prod as you used during training you should be fine.
    hope that answers your question! 🤗
Lalit Pagaria

Yes it answers partially. But I was looking for a deeper answer. 🙂
More of practical example

leandro von werra

I guess the deeper answer depends on your use-case 🙂 To work as well in production as on your local test set the most important thing is to make sure the data preprocessing steps are the same and the test set reflects the production data.
The second aspect is the performance on the test set: data cleaning is usually very important here but again which particular steps are necessary depends on your use-case and dataset.

Lalit Pagaria

In low resource setup. Is multi-head single transformers deployment is more suited or serially connecting multiple specific transformers? For example if we need translation + classification + summarization on same given input.

leandro von werra

I experimented in the past with creating a single model for three classification tasks and thus creating a model with three heads. the motivation was more to have just one large model in production instead of three and we didn’t see a significant performance improvement. Also this was the same task type just three different classification criteria.
A model that works well on a wide range of tasks at the same time is T5. Maybe you can tune it for the tasks you are trying to cover.
Alternatively one thing that can work well in low resource setting is to use models that are already trained on a e.g. summarization task instead of the raw pretrained models. We do this in the QA chapter: we use a model that has been trained on SQuAD and train it a bit more on the Amazon dataset which gives better results.
Which (if any) of these suggestion works depends a lot on what you are actual use-case looks like. My general advice is to build a good evaluation set that captures your use-case and lets you easily evaluate your models. Start creating a few simple baselines and then iterate quickly on different ideas. Without the evaluation pipeline the iterations will be slow.

Lalit Pagaria

Thank you leandro von werra
Yes trial and error is the way forward


Huggingface has done a tremendous job in making transformer based models accessible to a lot of people. As a result, trying these SOTA models is no longer difficult and has enabled organisations of all scales make use of them

  1. How do you feel about the widespread applicability of these models? Are there any cool use-cases in maybe some obscure domain where transformers are working there magic and which you’d like to share?
  2. The popularity and ease of use of Huggingface means even non-DS background folks also are able to use the power of BERT. Most of the times this leads to models which are not built taking into bias and other components of responsible-ai into consideration. What do you think can be done to tackle this (if it needs to be tackled) ?
    Thanks for the book! Definitely need to read this.
leandro von werra

Hi A, thanks for being here!

  1. I am personally very excited about ML in science. I think any tool that enables iterate faster in science will have a huge impact. One of my favourite application here is AlphaFold from DeepMind. Predicting how proteins fold is a critical task in chemistry and biology to make progress and until AlphaFold the only way to get good predictions was through very expensive and long experiments that could take weeks or month. With AlphaFold the same researchers can now iterate through 100s of molecules in a day which changes the game completely.
  2. I would even argue that most biased models that make headlines these days were deployed by DS folks 🙂 I am not an expert here and maybe Thomas Wolf has more to share. My two cents are that thinking about and investigation bias and impact of a deployed model should become a core task of DS lifecycle. This requires documenting on what exactly and how the model was trained and rigorous testing of how it performs in different situations. There is a lot of work going on in BigScience for example where people work on showing how a large language model can be built and released responsibly.
Mischa Ungermann

Hi guys, thanks for taking the time for this Q&A!
I was working with mostly CV topics for the last years and now I am curious to get an insight into this world of NLP everyone is talking about. For this I was wondering what resources you would recommend as an entry point to the field? I could imagine a certain book will be high on the list, and probably also the Hugging Face course, but maybe you know of more gems?

leandro von werra

Indeed our book is NLP beginner friendly and the main requirement are the core ML concepts (what’s a train test split, how to build and train a neural network, standard metrics etc). So if your are familiar with CV that won’t be an issue 🙂
I know a lot of people got into NLP with Chris Mannings excellent lecture series: Also the popular fastai course covers some NLP:

leandro von werra

So many great questions so far! Thank you and keep them coming! 🤗

Max Payne

It’s the first time I have seen a book (predominantly targeting a specific API) and also addressing the topic of ‘Text Generation’ (i.e. beam search, greedy decoding etc.). How does HuggingFace deal with these methods - I mean do we as a user have a choice in selecting the method? or the model has a default setting? (Sorry if this question is answered in the book)

leandro von werra

Yes, we go into that in more detail, but to give you the short summary here:
Every model in the transformers library that is able to do text generation has a .generate() method. This method is loaded with options to do all sorts of generation strategies: greedy, beam search, sampling (with top-k or nucleus), repetition penalties etc. You can also define your own stopping criteria or processors to be applied.
see here
Patrick von Platen wrote a great article about that function and how to use it:

Max Payne

Awesome. Thank you!

Marcello La Rocca

Hi! Congrats on the book, it looks awesome!
Question: where would you say it’s the main advantage in using Huggingface’s tranformers vs say GPT-3? What could be a disadvantage?

leandro von werra

There are a number of advantages:

  • you can host the modals on your own infrastructure - for many companies sending data to an outside API can be complicated, especially if the data is sensitive
  • you have a lot more flexibility to train the models and iterate through different training setups. you can also customize your architecture to your use-case
  • dopending on your setup hosting your own models can be much cheaper, especially if you have heavy workloads. Nils Reimers wrote a thread about this:
Marcello La Rocca

Thanks Leandro! Indeed, that totally makes sense.
Also thanks for the link! 🙏

Marcello La Rocca

2nd question (partially related, but also more about the book content): what’s the hardest challenge in scaling transformers?

leandro von werra

oh there are many, hard to pinpoint a single hardest 🙃 just to list a few:

  • engineering: training such models requires a large distributed infrastructure with <tel:100-1000 100-1000>s of GPUs. you need some resilience when some of them crash and you need to make sure they are optimally used. also at that scale you can be haunted by many numerical instabilities. in BigScience we are training such a model in the open and you can read the chronicles of the experiments:
  • dataset: training these beadts requires a lot of data (~TBs of text). at the same time you need to make sure that the data is clean which gets harder at that scale. you can‘t just look at it
  • release: how can one responsibly release such models. there is a lot of work going on about for example licensing.
    maybe Thomas Wolf has some more insights here :)
Marcello La Rocca

Ah, interesting! How can you deal with the numerical instabilities? I imagine they also propagate and get larger and larger 🤔
I’m looking forward to reading more in your book 🙂

Marcello La Rocca

Also what do you think about “sparsity-aware inference engines”?

Marcello La Rocca

3rd one: Do you think that transformers will be possibly applied beyond text/vision/speech? Maybe reasoning?

leandro von werra

Currently it seems that transformers are penetrating almost every field in ML. on the topic of reasoning you might find this recent paper by Google interesting:

Marcello La Rocca

Wow, that’s impressive!!! Thanks a lot for the link
58.1% on GSM8K is impressive (though also a reminder of long path ahead!)

Marcello La Rocca

(thanks a lot!!!)

Evren Unal

Thank you participating this event.
I would like to build a language translator some time in the future.
Q1) Can i use transformer technology to built it?
Q2) if so, did you give enough information in your book to build it?

leandro von werra
  1. definitley. there are for example ~1000 translation models already on the huggingface hub! see: you can either use them out of the box or further tune them on your specific data!
  2. we don‘t go into details on translation specifically, but we show summarization in more detail. conceptually the task is very similar (one input text and an output text) and we also show and explain BLEU which is often used in translation.
Evren Unal

thank you very much 👍

armin zirak

Hi all!
Thanks for the great book!
Q: If I want to use transformers for numeric data instead of text, how can I do that? Do you think feeding the numbers as text does the job in the best way or is there any specific solution?

Lewis Tunstall

Hi Armin, thanks for your question! There do exist transformers like SAINT for numerical features like the ones you would find in tabular data.
However, it seems that gradient boosting remains the state-of-the-art and it might be a while before transfer learning is truly cracked for tabular data (see plot).
There’s a very nice survey of these models here:


Hi, thanks for taking the time to answer questions here! Looking forward to the book!
Does the book address issues related to deploying and using transformers in production settings?

Lewis Tunstall

Hi Allan, yes Chapter 8 covers various techniques related to optimisation including:

  • Knowledge distillation
  • Quantization
  • Graph optimization with ONNX Runtime
  • Pruning (although at the time of writing it wasn’t easy to prune transformers effectively)
    The end result is figure like this which shows how you can reduce the latency of a BERT model by ~7x 🙂


Shantanu Ladhwe

Hey, thanks for the amazing book. I am yet reading, and really liked the explanation of transformers there.
Question: How different would be Google’s next AI Architecture- Pathways from the Transformer architecture?

leandro von werra

Hi Shantanu Ladhwe, not an expert on Pathways but as far as I understand it is mainly a vision towards multi-tasking, multi-modality, and maybe a switch to sparsity (see Jeff Dean’s article here). This vision could be (and will likely be) realized with a Transformer architecture. There are already multi-tasking (e.g. T0pp), multi-modality (e.g. CLIP, Perceiver), and sparse (e.g. SwitchTransformer) transformer models out there.
The Pathways Language Model (PaLM) uses a classical transformer decoder architecture and the main innovation is around infrastructure to train the model efficiently on 6000 TPUs.

Max Payne

I wish I could use a Transformer here to generate questions (and hence increase my chances of getting the book) :-)

To take part in the book of the week event:

  • Register in our Slack
  • Join the #book-of-the-week channel
  • Ask as many questions as you'd like
  • The book authors answer questions from Monday till Thursday
  • On Friday, the authors decide who wins free copies of their book

To see other books, check the the book of the week page.

Subscribe to our weekly newsletter and join our Slack.
We'll keep you informed about our events, articles, courses, and everything else happening in the Club.

DataTalks.Club. Hosted on GitHub Pages. We use cookies.