Questions and Answers
Hi, my question for the authors is - since there are many tools for NLP practices, how can we understand in depth working of those tools, is it also included in the blueprint?
Also, in case of different spellings for the same words, and the use of characters such as -à,è,ü,ö and others is there a tool or method to compare and categorize the words into the same, without translating the words into the same language and then analysing?
Hi Asmita, our book focuses on best practises how to work with the tools different NLP libraries provide. It still gives some theoretical background on the concepts used, though.
And yes, there is a blueprint for the treatment of different spellings. Just have a look at the notebook of Chapter 3 on Git and search for “character normalization”: https://github.com/blueprints-for-text-analytics-python/blueprints-text/blob/master/ch04/Data_Preparation.ipynb
Does you book deal with ‘Personally identifiable information’? In either case, any thoughts on this field and tips if possible?
Hi Jens Albrecht, Sidharth Ramachandran, and Christian Winkler thanks for writing the book! Could you share with us about the fields you see the most applications of NLP?
Good morning everybody!
Max Payne: We don’t explicitly address data privacy and anonymization. That could be part of the data acquisition process, but it is very complicated and depends on legislation in different countries. Pseudonymization might be a good idea to enable aggregation later. We have quite a few commercial projects where aggregation in the first step solves these issues - but that of course depends on the use case. Apart from data privacy, copyright law in different countries might also be something worth considering.
Quynh Le: Very interesting question, this probably depends on the field you are most interested in. If you concentrate on enterprise data, uncovering the structure of large document archives might be exciting. In the social sciences, you can avoid performing surveys using NLP and user-generated data. Apart from all these applications, the recent achievements in contextualization using transfer learning are really impressing.
Hi Christian Winkler Jens Albrecht Sidharth Ramachandran, What is your opinion on automating NLP processes? Will automating them lead to some percentage of loss when compared to a human validating the process step by step?
Hi Christian Winkler Jens Albrecht Sidharth Ramachandran, thank you writing this book.
Recently I have been working on some text analytics projects and all of it is very interesting.
So motivation for my question is that we usually see only one side of the problem we are facing. Either the negative side or only the positive side. Biggest example is when we talk about any sort of safety analysis we only look for accidents that have happened rather than researching about the positive aspect of the safety which includes campaigns, awareness drives and we never dig deeper into those aspect basically.
My question is if I want to layer the findings of my analysis over time say for example: over a period of time how did a specific topic say climate (as in terms of semantics, popularity, understanding etc basically negatives AND positives) have changed.
Here I am using climate but we could think of this problem in very wide spectrum say a product, software, book, organization etc.
Hence, is there a way that we could work around something like this?
Thank you in advance.
Thanks for your question, Nikhil Shrestha. You could use semantic embeddings like word2vec for supporting information retrieval. Another option would be to use something more full-fledged like txtai
Good question, Asmita! I think it depends on the kind of task you are automating. Nowadays, language models based on transfer learning can achieve human performance (some are even superior to humans). However, irony and sarcasm is still difficult to interpret. In the unsupervised or semi-supervised regime, there is no real loss, it just makes interpretation easier. That’s also what we see in our daily projects - we are using AI more as a tool to supercharge humans.
Thank you Christian Winkler for swift reply on the question, will check the resources and update you accordingly.🙂
Hope that helps Nikhil Shrestha and I haven’t misunderstood your question
Yes Christian Winkler it pointed me to the direction which indeed will help
Thank you for that.
You totally understood the question 🙏🙂
We would be happy to answer further questions. Feel free to also ask questions about NLP in general - it does not have to be directly related to our book.
any suggestion for a book talking about self driving cars! 😁
|I think <#C01AXGTRESH||books> would be a better channel for this|
But if you already know a book about self driving cars and would like to invite the authors here to this channel, please let me know =)
Of course Alex, thanks for helping 🙏
Hello Christian Winkler Jens Albrecht Sidharth Ramachandran! 👋 Thank you for writing this book full of practical solutions. I just have to ask you for your opinion on transformers in NLP (BERT, RoBERTa GPT-3) advantages and disadvantages. Did you use them?
Thanks for the question Maja. We do make use of the BERT model in one of the chapters of the book. In fact I’m also seeing the use of transformer based models heavily in the industry. They work surprisingly well without much preprocessing. They are however not good across languages but here too there are variants that the community releases online that are helpful. The question of bias in these models must be tackled though and it depends on each use-case to determine how critical this is.
Thank you so much Sidharth Ramachandran for the answer. I’ll look into it!
Finetuning BERT and RoBERTa models often leads to quite similar results. A good starting point is the Huggingface page with all the models https://huggingface.co/models
GPT-3 is a different story as it is a commercial model which is mainly used for generating text. Alternatively, you could take a look at GPT-J from EleutherAI.
These models are normally too large to train them on your own or even fine-tune them.
And they are still growing. Last week, NVIDIA and Microsoft announced the “Megatron-Turing Natural Language Generation model” (MT-NLG) with a whopping 530 billion parameters.
Thank you Christian Winkler so much for everything! I’m going to NVIDIA conference, start reading your book when I get it and I have started with Hugginggace.
Thank you Jens Albrecht, Sidharth Ramachandran Christian Winkler For your time!
Thank you Jens Albrecht Sidharth Ramachandran Christian Winkler for sharing the knowledge and clearing our doubts.
Thank you so much Jens Albrecht, Sidharth Ramachandran and Christian Winkler for all your guidance and fast replies to our questions!
Thank you Jens Albrecht, Sidharth Ramachandran, Christian Winkler for coming to share about NLP!
Good morning Jonathan Rioux, very interesting book! I have a few beginners questions.
from another beginner, these were really nice to read so thanks for asking, Tim, and thanks for answering, Jonathan!
- When should I start using Spark? I mean how large should my dataset be? Does it make sense to start using Spark, if my dataset still fit into memory, but I expect the size to increase?
This is an excellent question 🙂 I don’t have a straight answer, but let me share with you the heuristics that I use when deciding for myself.
- PySpark is getting much faster for single-node jobs, so you might be able to have acceptable performance with Spark on a single node right off the bat! See the following link about a discussion about this. https://databricks.com/blog/2021/10/19/introducing-apache-spark-3-2.html
- Koalas was introduced in Spark 3 and merged into
pyspark.pandasas of Spark 3.2. Now more than ever, you can convert Pandas code to PySpark with a lot less fuss. 🙂
- For memory allocation, I try to have a cluster with enough memory to “store” my data and have enough room for computation. Data grows quite fast, and if you have a feel that the data source will grow (for instance, historical data), I find it easier to start with PySpark, knowing it’ll scale.
If you need a fast an loose rule for processing data (not counting ML applications), I would say that if you can’t get a single machine with 3-5x the RAM your data sits on, you probably want to reach for Spark, just for comfort.
- How much worse is the performance with pySpark if it used on a small dataset.
I think I replied on your previous question 🙂 . The “Spark single-node performance tax” shrunk dramatically since Spark 3.0 and even more since Spark 3.2.
In practice, I find that with very small data sets (a handful of hundred of rows) you will have much worse performance depending on the operations: that being said, it’s often a difference of 0.29 sec vs 0.85 sec which I am not too concerned about.
- What are the advantages of using databriks?
Databricks has many things for itself!
- Databricks provides proprietary performance improvements over open-source Spark so your jobs may run faster with no changes. I am especially excited about Photon (https://databricks.com/product/photon) which takes your Spark data transformation code through a new query engine.
- The notebook experience out of the box is quite good (and I am saying this from the perspective of a person who doesn’t really like notebooks). I like being able to create ad-hoc charts from a result data frame and explore my data right from the same interface.
- Databricks connect (https://docs.databricks.com/dev-tools/databricks-connect.html ) is the simplest (to me) way to connect my IDE on a remote cluster with the minimum amount of fuss. It can be a little capricious, but when writing the book, I’ve used much worse hacks to connect to a remote REPL with Spark enabled…
- Databricks provides additional capabilities (Delta Lake for data warehousing, MLFlow of ML model/experiments management, etc.) which play well with the overall ecosystem.
- The ecosystem is quite consistent around all three major cloud providers (AWS, Azure, GCP), which help if you’re moving around. :)
- If we want to train ML models using pySpark, does the model have to support distributed training?
Spark’s ML model collection all work out of the box. They are all listed here: https://spark.apache.org/docs/latest/api/python/reference/pyspark.ml.html.
Some algorithms naturally lend themselves better to distributed computing and perform (runtime) much better than other. Random Forest for instance distributes super well, GradientBoosting a little less so.
On top of that, you can also use user-defined functions (UDF) to run single-node models in a distributed fashion (the model would run on a single node here). This allows for parallelizing hyper-parameter selection. I am considering to write an article/do a video on the topic as it is quite fun to do!
thank you 🙂