Questions and Answers
Hello, Savas Yıldırım and Meysam Asgari-chenaghlu for doing this.
How long until you see transformers replaced with a new architecture or technique?
Hi Matthew Emerick
Two really tough questions 🙂 . Thanks a lot. I think we will continue with transformer architectures for 2-3 years. However, alternatives are being developed for many of its sub-parts. For example, the attention layer is the layer that creates the most complexity, and some sparsification is being employed for this memory and computational burden. It will sound like an advertisement, but we discussed them clearly in CH 08: The Efficient Transformer (pruning, quantization etc.) section.
In some studies, while the tokenization part is completely removed (see ByT5), in some studies they try to remove or change the attention part (FNet: Mixing Tokens with Fourier Transforms). It is said that only FF Neural Net will remain in the final stage. But it’s too early to say that they will be successful and will be used in the industry at scale.
I think I answered two problems in one shot :)
Do you foresee a method to reduce the costs of using transformers?
Very excited to see this book come through! Savas Yıldırım and Meysam Asgari-chenaghlu, I think it’s fun to explore everything transformers are doing so well at and the novel ways they are being applied. I’m always curious about the boundary conditions of various approaches. What are some things that transformers are not great at yet and alternative methods are recommended?
Thanks David Cox for the questions.
Let me try to say the first answer that comes to my mind. Maybe Meysam Asgari-chenaghlu will add something.
Attention can only work with fixed-length text strings, which is a really important limitation whereas we can cope with it with other traditional RNN like models and they are still in use. Well, the texts are needed to be split so that they can be a proper input in size for the Transformer. Likewise The document segmentation is the second issue, such as splitting a sentence from the middle, then we suddenly lost a significant amount of context. Even though Transformer XL-like model can handle it, we face such problems in the field.
Thank you Savas Yıldırım!! I really appreciate this answer.
Hello Savas Yıldırım, Meysam Asgari-chenaghlu. What level of proficiency in ML / Engg is expected from the reader here?
What are some of the most creative uses of Transformers you’ve seen?
Hi xnot, thank you for the questions. Our expectation from the reader is that they should have at least experience in machine learning & AI and its culture and that they should have a good programming background.
It has many effective aspects. What surprises me the most and comes to my mind first is that it successfully processes two sentences, which is hardly handled with traditional approaches, at the same time and can even encode cross-sentence anaphoric (reference) relationships thanks to contextual word embedding, which in turn can solve many tough NLP problems with stunning quality. xnot
Hi Savas Yıldırım and Meysam Asgari-chenaghlu.
1) Transformer was a breakthrough. Then a Reformer is proposed. What is ahead?
2) Transformer was adopted for computer vision. Do you think it will have the same success as in NLP.
Hi Dr Abdulrahman Baqais. Thank you for your challenging questions.
1) There are some ideas to radically change the attention mechanism and tokenization mechanism. I have seen many articles in this direction. But it will take time to reflect on the industry. Nowadays we see different models in addition to Reformer, which is mostly called efficient transformers, to name a few BigBird, Reformer, Performer, linformer, longformer, and so on… They mostly concern computation and memory efficiency of the architecture.
It has been mentioned in a few articles that the transformer surpasses other architectures (CNN-based) in image processing and signal processing , but again, this may change over time.
There are models published for audio processing and CV on the HuggingFace Hub and they seem to be very successful. Please check!
Hi Savas Yıldırım and Meysam Asgari-chenaghlu. i have few questions for you guys 🙂
- What library is being used in the book? pytorch or tensorflow?
- Do you think learning about transformers is enough like don’t we need knowledge about vectorizers, embeddings? like i see people not learning about languages or machine learning and they directly jump to advanced stuffs like applying Sota models or learning about sota models.
Hi State Of The Art
Thanks for the question
- Transformer models discussed in the book can based on both libraries (framework) thanks to HuggingFace transformers library. In addition to these libraries, many other important ones have been utilized whenever needed, such as sentence-transformers, flairs, Bertviz etc.
- It’s not enough 🙂 We are proceeding by putting the basic building blocks on top of each other. To develop a good model, it is necessary to know all the building blocks.
We need to be familiar (or even more) with deep learning approaches from a single Perceptron to GPT (175B parameters). Otherwise, We cannot produce new models and build effective SOTA models without understanding the transformation that AI has gone through.
But still, a lot of work can be done without knowing them in detail. However, there can be a blockage in creative and challenging problems.
Meanwhile, we have become able to solve hard problems that were called unsolvable in the past, with 2-3 lines of python code.
Thank you for the amazing reply. 🙂
Hi Savas Yıldırım, I am a total NLP and transformer newb. I hope you do not mind my simple questions.
Hi Tim Becker thank you for your interest.
1)To put it simply, This architecture effectively represents a sentence using contextual word embedding in a Feed-Forward Neural Net, and also allows transfer learning, allowing us to solve many NLP problems even with very little labeled dataset.
2) In transformers, we use self-attention mechanism when representing a word and sentence. In NLP we use the neighboring words around a word to represent both in traditional and deep learning framework. But, the attention mechanism selectively uses some surrounding words (using weights), not all neighboring words. We repeat this for each word and finally, create contextual word embedding and eventually sentence embedding.
To start simply you can use pipeline module as follows:
from transformers import pipeline
summarization = pipeline(“summarization”)
If you want to get your hands dirty and have a labeled sentiment dataset in any domain, you can repeat our code with your project (fine-tuning phase) in GitHub
- Could you explain the concept of what a transformer actually is? 😅
- What is the attention mechanism?
- Do you have a toy project in mind that would be good to get started with transformers?
Hi Savas Yıldırım,
Could you please elaborate on your sentence in Chapter 1 (Transformers section), where you were referring to tokenization schemes?
universal text-compression scheme to prevent unseen tokens on the input side
I mean what does tokenization have to do with compression here?
Hi Max Payne,
Actually it means that most of the tokenization schemes (white space and related ones) can not deal with OoV problem. But instead the BPE and related ones can deal with it. The main idea behind using subwords or byte-pairs comes from text compression. Finding the most common byte-pairs in huge data and then mapping them to smaller representations such as a single byte makes the compression more effective. The same idea is used in transformers tokenizers to make the tokenization more robust. I have seen some other articles that use similar method variations.
Philip Gage, A New Algorithm for Data Compression. “Dr Dobbs Journal”
Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
Thanks. I was going through the book and found it a great read.
Please correct me if I am wrong.
In - say e.g. CNN - when we use Transfer Learning, like VGG16 on a cat-dog problem, we generally remove then add (or modify) the last layer AND FREEZE ALL THE PREVIOUS LAYERS. But while training transformers, on notebooks that I came across, I haven’t seen people freeze all but the last layers. In fact, they just train all the layers. Is there any wisdom behind this or is it technically incorrect (since the HF model has already been pre-trained on a corpus)? What would you recommend?
Hi Max Payne, Transformers are trained in an end-to-end fashion contrary to CNN architecture. We mostly add a task-specific thin layer on top of Transformers. Freezing would deteriorate the system’s performance. You can simply try by saying bert_model.trainble=false or some specific layers 0..12. However please keep in mind that, each layer has its own characteristic. While some layers encode semantic information (mostly further ones), some layers (mostly initial layers) encode syntactic information. For example, some are experts in reference relations (he->John) . Therefore, you can determine the layers you will freeze according to the down task problem. Once again, freezing is not used much in regular transformer model training. It is worth discussing why.
Hi Savas Yıldırım and Meysam Asgari-chenaghlu! Thanks for your replies for the questions. And I also have one.
As I know today all State-of-the-Art language models based on transformers (like GPT, Megatron etc) are trying to improve their performance by increasing the number of parameters and the related with it computational powers. Do you think this trend will remain? Or maybe some other ways is have already developed?
Hi Timur Kamaliev Thank you for your question. The contribution of such large models to the system in terms of success is around 2% at most. However, it is important for us that the models should be lighter and more accurate. It is very vital especially to use them on edge devices and other sorts of limited devices. Therefore, models such as efficient transformers are more focused on the speed and memory efficieny of the model.
Another important problem in Transformers is the input size (e.g. 512 tokens). Increasing it is more vital than increasing the parameters at the moment. So what we are looking for is to be able to work with longer inputs and produce lighter but successful models. That is, sounds like a parameter show to me.