Transformers for Natural Language Processing

Krzysztof Ograbek

Denis Rothman thank you for doing this!! Maybe I’ll start with this question: Do transformers outperform RNNs in any kind of NLP tasks? What makes transformers so great?

Rodney Silva

Denis Rothman Hi! How should a beginner implement Transformers? Using Hugging Face,Google Trax,MS Azure or implement from scratch to control maintenance?

Mert Bozkır

Denis Rothman What do you think about future of Natural Language Processing? What will the location of the Transformers in future ?

Lalit Pagaria

Denis Rothman Isn’t Transformers constraints by compute resources? What is way forward for startups with less resources?

Vladimir Finkelshtein

When do you think we will see transformers for non-NLP tasks? There seem to be papers and attempts with transformers for vision, but no breakthrough, or at least no mainstream adoption of it.

Denis Rothman

I’m live so you can ask any question you wish.

Denis Rothman

Krzysztof Ograbek There are to reasons transformers outperform RNNs:

Denis Rothman

Krzysztof Ograbek I mean there are “two” 🙂 main reasons: 1.optimal transport= RNNs carry all of the information from word(or token) to word and pile the information up in a big backback. 2/ The architecture of RNNs is obsolete with variable sized layers. Transformers have fixed sized industrialized layers. I explain this in chapter 1 of my book but also in this video:

Denis Rothman

Krzysztof Ograbek This is the link to the video https://youtu.be/tQTpCvZ1-0w

Denis Rothman

Krzysztof Ograbek Proof? Transformers have wiped RNNs off the top ranks of the SuperGlue Leaderboard:

Denis Rothman

Krzysztof Ograbek Here is the link to SuperGlue:https://super.gluebenchmark.com/leaderboard/

Denis Rothman

Rodney Silva A beginner should begin by understanding the original Transfomer model explained in Chapter 1 of my book. Even with a piece of paper and a pencil! Or a spreadsheet!

Denis Rothman

Rodney Silva Implementing in a real-life project is something else entirely. If it’s a low level non-strategic project and isn’t really necessary and the project has only a small budget because of that, you can implement almost anything you find since the interest of the solution will be limited! Now if you are implementing Transformers in a critical project, that’s something else…

Denis Rothman

Rodney Silva The first question is what is the goal of the project. To really understand it. Then to go hunting for the right data with standard software approaches. This can take anywhere from 1 month to 6 months, a year, and maybe two years! In the meantime, you need to find a reason for your customer or employer to keep you on the project so it stays alive! If you understand the project, you can begin with a useful user interface that enables the users to visualize, understand their data and run short but productive tasks. …

Denis Rothman

Rodney Silva If you can get your project in cruise mode as described above then you have plenty of time to build a transformer prototype of the project with the model you find best suited for the project in terms of SLA (Service Level Agreement) with the customer. SLA is standard IT practice that is well documented and contractual.

Denis Rothman

Mert Bozkır Your question : What do you think about future of Natural Language Processing? What will the location of the Transformers in future ? My answer: I see transformer-driven NLP (until a new model comes up) on Cloud platforms such as AWS, Googe Cloud or IBM Cloud or another giant such as Microsoft Azure. Why? They have reliable scalable servers! I doubt anybody can match without the power to back up the sales. The smaller solutions will survive but not be critical unless they are partners of the tech giants. NLP will spread out to every area of human activity.

Denis Rothman

Vladimir Finkelshtein Your question: When do you think we will see transformers for non-NLP tasks? including vision. My answers: First of all,l let’s take vision out of the picture because there are more legends than actual reality here. Computer Vision (CV) relies heavily on massive non-AI algorithms. Then the CNNs do a pretty good job. Transformers are kicking in for image captioning. Ok that’s for vision because vision is ok right now. There are a lot of solid AI and non-AI tools. Now in non NLP, you have transformer driven-Recommenders! They can be applied to e-commerce behavior prediction and also can be used in manufacturing. See my video here:https://www.youtube.com/watch?v=tQTpCvZ1-0w&list=PL9uLp9IOO56Gv0YEnRWdrIZOPKwRBHnJe&index=10&t=1s

Vladimir Finkelshtein

Denis Rothman Thanks, awesome examples. If I understand correctly, the tasks are just naturally reframed as NLP tasks (e.g. in recommendation systems: sentences = list of products that one likes, recommendation = fill the mask).
As for the optimal transport solution, it wasn’t quite clear how the rewards are used. Do you just pick a threshold, for which you keep a sequence of actions if its reward is above the threshold, and throw away bad sequences? Also, I didn’t quite understand how initial and final distributions are encoded, but I guess there are more details in the book…

Vladimir Finkelshtein

As for vision, for example this paper (https://arxiv.org/abs/2010.11929) claims that transformers achieve almost SOTA or SOTA performance with significantly reduced training resources. So perhaps scalability can play a role in adopting transformers there.

Denis Rothman

Vladimir Finkelshtein Now, let’s take your question a bit further for non-NLP transformers. I see powerful solutions being kept secret by tech giants. For example, in terms of recommenders, China has access to far more data in terms of volumes and information than anybody in the world. They are advancing toward a digital society and are many years ahead of the West. For example, they have a national digital currency that is expanding with information (like VISA with its tens of thousands of transactions per minute but China is more like 300 000 transactions per minute). You can easily perceive the patterns you can draw from that with sequences of customer behavior.

Denis Rothman

Lalit Pagaria Your question:Isn’t Transformers constraints by compute resources? What is way forward for startups with less resources? My answer: Let’s go beyond the hype and into the real topic! If you use a nice simple robust model BUT have excellent datasets you will even beat GPT-3 which used a supercomputer!!! Proof? The Pattern-Extracting Training (PET) method I describe in the introduction of Chapter 6 of my book. I describe an average sized transformer model that trained on an optimized tiny dataset and…beat GPT-3 on the SuperGlue Leaderboard!!! More proof? Look at line 13 of the SuperGLue Leaderboard today and you will see PET (Tom Schick) with a nice and good accessible computer and then look down at line 16. What do you see today? GPT-3 is 3 ranks behind! Conclusion: Humans that have imagination can beat super-rich humans and machines any time!

Lalit Pagaria

Thank you Denis Rothman for clarification 🙏

Denis Rothman

You are more than welcome! 😊👍🙏

Rodney Silva

Denis Rothman Do I need to build a Transformer from scratch to fully understand it?

Denis Rothman

Rodney Silva No. But you do need to understand the architecture of the Original Transformer described in Chapter 1. You can have a look at the chapter and ask me questions this week in our thread.

Vladimir Finkelshtein

What is your position on positional encodings? They seem like an afterthought and it is not so clear when to use them. I saw suggestions in the forums to try to train models with and without them, because a priori it is not clear in which type of problems they are helpful.

Denis Rothman

Vladimir Finkelshtein Positional encoding is a fundamental part of the architecture of the Original Transformer. RNNs contained a separate cumbersome positional vector.
Transformers ADD positional encoding to word(token) embedding. This is a powerful feature : the word’s “meaning” will take its position into account which is critical for grammatical and semantic structures.
That being said à deBERT model trains positional encoding separately.
In any case, positional encoding is a very effective tool.

Vladimir Finkelshtein

Are there other metrics outside of the GLUE benchmarks on which NLP models are compared? For example, if I compared regression with decision trees, I could compare sensitivity to outliers, tendency to overfit, reduction of performance due to imbalanced data, robustness, explainability (I guess this one is model agnostic now) etc.
Could we compare which NLP model is more susceptible to biases, because they are overrepresented in the data?

Denis Rothman

Vladimir Finkelshtein Interesting question.
First, some context. SuperGLUE was mostly created because the recent arrival of transformers blew the GLUE human baselines out of the sky! So SuperGlue was a solution. Now transformers are again exceeding the human baselines of SuperGlue.
If you use decision trees or any other ML/DL algorithm just:
1.Download the SuperGLUE datasets and train them with any algorithm you wish to explore and see what happens. You might even end up in the leaderboard. Who knows? 🤔

Zhi Men Lau

Hi Denis Rothman what is your take on interpretability of Transformer? There is a lot of buzz around explainable ML and interpretability, how does Transformer address these issues compare to other technologies.

Denis Rothman

Zhi Men Lau The best Explainable (Interpretable) AI methods are MODEL AGNOSTIC. This means they use the input data to explain very precisely why a result was obtained.
I explain this in my book on Hands-On Explainable Artificial Intelligence(XAI)
For example, for an input such as “the coach(bus or sports coach?) was unhappy”,

Denis Rothman

If you have any questions please feel free to ask me.

Zhi Men Lau

Thanks Denis, didn’t know about your other book on XAI. Going to check it out.

Rodney Silva

Denis Rothman Can you guess what will be the evolution of transformers in the next years? Less complexity,efficency,less training time,etc…

Denis Rothman

Rodney Silva Transformers will expand to every single field involving sequences including recommenders.
Then a new model, one day, will take over. Mastering transformers will make the transition to the future much faster for developers and designers.

Krzysztof Ograbek

Denis Rothman Thank you for yesterday’s answers. Correct me if I’m wrong, but I believe that training transformers doesn’t require any knowledge about the language itself. I mean to train a transformer we just need a pre-trained word embedding, feed it to the transformer and wait for the results. So the question here. Nowadays, what is the point of learning things like: Linguistic syntax, dependency parser, Part-of-Speech, Named Entities, etc? How does this “language knowledge” make one a better NLP Specialist?

Denis Rothman

Krzysztof Ograbek Good question! In real life implementations the models take weeks if not months to train. Mastering the Linguistics of the Superglue tasks, e.g.,and translations will save you an incredible amount of time! Otherwise, understanding errors and correcting them can take exponentially longer.
😉

RH

Denis Rothman: If someone is starting their journey into NLP, what is the pathway you would recommend them to take? What knowledge/skills makes a strong industry ready NLP practitioner?

Denis Rothman

I recommend two axes:

Mastering basic calculus, matrix and vector multiplications and good basic statistics.
Linguistics. Knowing what a lexical field is, what semantics is about, phonology(intonation, for example)
Then everything in NLP becomes easy to understand!

Doink

Denis Rothman Is learning about Bag of words, tf-idf, w2vec, rnn and lstm a waste of time? How much do we need to know in NLP it’s an ever growing field.

Denis Rothman

Nothing you learn is a waste of time. Learn everything you can and spend all the time you have.
You will then quickly discover two fantastic things about yourself :
1.The more you learn the faster you learn something new. It’s exponential!
2.The more you know, the faster you will implement NLP projects. On top of that, you will be able to explain your work much better to others in a team. 👍

Doink

what is the proof of 1. ?

VK

Denis Rothman What type of research was implemented for writing this book?

Denis Rothman

1.First my background helped: mathematics and linguistics. You can find out more in this article
2.My previous research, corporate experience and writing books. You can find out more here
3.Reading hundreds of pages of papers on Transformers and exploring all of the source code available in 24/7 mode. It was quite a fantastic experience! 😎😜😀

VK

Thank you Denis Rothman

Akshat

Denis Rothman Hi Denis, can these transformer models be used in predicting physical quantities in car crash simulations or other numerical simulations?

Denis Rothman

Let’s be careful here! Transformers are excellent in predicting sequences. They can be used to predict if a car crash is probable.
However, for precise numerical series, I would rely more on standard math.

Akshat

Thanks Denis Rothman! 🙂

Rodney Silva

Denis Rothman What’s the most common difficulty for people trying to learn the architecture of Transformers?

Denis Rothman

The most common thing I have seen is a lack of patience and underestimating the work it takes to understand the Original Transformer as described in Chapter 1 of my book. If someone takes the time to understand this chapter, then the rest becomes easier if not easy with some basic knowledge of Linguistics that really helps.

Lalit Pagaria

Denis Rothman this not a question but appreciation message. Tomorrow I received your book still in first chapter. So far so good. Really nice book. 🙏

Denis Rothman

Thank you very much for your encouraging message! Please ask me any questions you may have. 😊🙏

Lalit Pagaria

Thank you Denis 🙏

Denis Rothman

Your more than welcome. 😊

Rodney Silva

Denis Rothman How much time did it take to write this book, from the research till the final version?

Denis Rothman

The time it took requires context :
1.I have been doing AI research for the past 35+ years on a daily basis from expert systems to ML/DL. In that same period I implemented my research on key corporate sites.
For more you can peek into my LinkedIn profile starting here.
2.I studied linguistics in college and at the time taught computer science as well. I’ve been studying cognitive sciences all my life.
3.I’m used to writing papers and books on AI.
That being clarified 😊, I did the research, (papers and source code), tested code/wrote code, and wrote the book in 10 weeks.
One last point. I’m a workaholic when I write. I ‘m in 24/7 mode, can get up at any of the night because something just popped up, develop at the dinner table or anywhere.
As long as the book isn’t finished, I just keep on pounding on the laptop!😊

Denis Rothman

P. S. Here is the link mentioned in the message :
My 1982 Word2vector-Word Piece model patent led to an AI NLP Cognitive Chabot and a Turing-APS model used in Major Corporations to this day!
https://www.linkedin.com/pulse/did-you-miss-ai-parsing-train-denis-rothman

Alexey Grigorev

10 weeks! Wow! That’s super impressive!

Krzysztof Ograbek

Alexey Grigorev, agreed. Denis Rothman, this is just amazing. Your background also

Denis Rothman

Thanks. It’s hard work. 😊

Krzysztof Ograbek

Denis Rothman I don’t know how to properly phrase my question, but I hope you’ll know what I mean. How to take advantage of speaking multiple languages? How to use it in NLP field?

Denis Rothman

Mastering more than one language is naturally an asset in NLP. First, it develops better knowledge of language structures and meaning.
When testing translation models, it’s a good asset.
However, we cannot master all of the language s we will face, so there is a limit to that asset!

Krzysztof Ograbek

Denis Rothman Could you recommend some good resources for learning linguistic, that can be useful in NLP?

Denis Rothman

A good place to start with linguistics is a good elementary or high school book of grammar!
Then an easy book on semantics and phonology /phonetics.
In short, the easier the better to build a solid approach.
Then once that is done, look for a nice book or course you like. 😊

Krzysztof Ograbek

Denis Rothman, thank you again. I found this tutorial on YT, it has 100 videos. Would you be so kind to skim through the titles and say if the content makes sense for beginners? I already learned plenty from this. I just wonder if the direction is right 🙂

Denis Rothman

I see that these videos were designed by Stanford professors. If you find them interesting, then of course you can follow them.
Just keep transformers in mind as well.

Rodney Silva

Denis Rothman What was your reaction the first time you realised how transformers worked? Do you think that you could have invented Transformers?

Denis Rothman

When I first explored the Original Transformer model, I was in awe. The small Google team that was working on this new model were experimenting all sorts of ways to solve the limitations of RNNs.
It was low level trial and error. Then they decided to drop the 30+year old concept of recurrence and replaced it with attention!
They also industrialized the layers that are all the same size, spilt the layers into parallel processing, and went on like that for everything in it.
They surprised themselves by outrank ing everybody on the NLP leaderboards.
I was admirative of this little team within a large corporation. 😊

Rodney Silva

And how about the second question? Denis Rothman

Denis Rothman

Right. Could I have invented transformers?
No. Why?
I don’t think of NLP only in terms of statistics. I think transformers are a fantastic evolution but will also, like RNNs, be prehistory as IoT sensors develop and concepts are added.
In my book in AI by Example, I teach a CNN how to learn concepts in Chapter 10 CRL. In chapter 6 on translations, I introduce symbols in the Google translate API:
link
Now why do I think this?
I often use the simple word “hello”.
There are billions of interpretations of that word.
Let me give you some.

Person A is sitting in an office working. Person B comes in and says “hello” in a neutral tone.
Context : B has never said hello in 10 years to A, even when in the same elevator.
Interpretation by A: Am I going to be fired? What’s going on?etc.
2.B walks in and mumbles a gloomy hello.
Context : B is always chirpy, smiling and happy.
Interpretation by A: Did I do something wrong? Is something wrong with B? What’s Going on.
3.B comes in and gets very very close to A and says “helloooo” like a hungry wolf.
Interpretation by A: This happens every day. Oops! Is this some kind of harassment? What should I do.
4 to infinity!
Now add more words, situations and body language, cultural habits and emotions!
Humans are very complex machines faced with an infinity of complex situations.
Statistics can help. But more is to come!

Rodney Silva

Denis Rothman What do you think is the most brilliant thing in Transformers: multi head attention or sinusoidal positional encoding?

Denis Rothman

Using attention with simple matmul operations was baffling. Then adding cos/sin positional encoding to the vectors/matrices instead of adding more vectors was brilliant. I enjoy trigonometry so I liked that.
Both ideas are beautiful. Some recent models have separately trained positional encoding(deBERT, for example).
Everything is brilliant. It’s like building a new motor engine in your garage at home. That’s how they did it.

Lalit Pagaria

Denis Rothman what is faster ways to test which encoding would be great for a task instead of going into whole train - test life cycle. At least any fast intuitive way to try first in order to get fast feedback.

Denis Rothman

The fastest way to choose a model in production is to have learned transformers from A to Z in general. And during that process select the one you like the best.
Once that is done, there are many ways to train and fine-tune models. Have a look at the PET approach, for example, in the beginning of chapter 6 of the book.
PET is pattern-extracting training which processes the data BEFORE training a model. It like a good teacher that well prepares a course instead of throwing raw information at the students!

DataTalks.Club

Transformers for Natural Language Processing

by Denis Rothman

The book of the week from 19 Apr 2021 to 23 Apr 2021

Questions and Answers