Questions and Answers
Good morning Jonathan Rioux, very interesting book! I have a few beginners questions.
from another beginner, these were really nice to read so thanks for asking, Tim, and thanks for answering, Jonathan!
- When should I start using Spark? I mean how large should my dataset be? Does it make sense to start using Spark, if my dataset still fit into memory, but I expect the size to increase?
This is an excellent question 🙂 I don’t have a straight answer, but let me share with you the heuristics that I use when deciding for myself.
- PySpark is getting much faster for single-node jobs, so you might be able to have acceptable performance with Spark on a single node right off the bat! See the following link about a discussion about this. https://databricks.com/blog/2021/10/19/introducing-apache-spark-3-2.html
- Koalas was introduced in Spark 3 and merged into
pyspark.pandasas of Spark 3.2. Now more than ever, you can convert Pandas code to PySpark with a lot less fuss. 🙂
- For memory allocation, I try to have a cluster with enough memory to “store” my data and have enough room for computation. Data grows quite fast, and if you have a feel that the data source will grow (for instance, historical data), I find it easier to start with PySpark, knowing it’ll scale.
If you need a fast an loose rule for processing data (not counting ML applications), I would say that if you can’t get a single machine with 3-5x the RAM your data sits on, you probably want to reach for Spark, just for comfort.
- How much worse is the performance with pySpark if it used on a small dataset.
I think I replied on your previous question 🙂 . The “Spark single-node performance tax” shrunk dramatically since Spark 3.0 and even more since Spark 3.2.
In practice, I find that with very small data sets (a handful of hundred of rows) you will have much worse performance depending on the operations: that being said, it’s often a difference of 0.29 sec vs 0.85 sec which I am not too concerned about.
- What are the advantages of using databriks?
Databricks has many things for itself!
- Databricks provides proprietary performance improvements over open-source Spark so your jobs may run faster with no changes. I am especially excited about Photon (https://databricks.com/product/photon) which takes your Spark data transformation code through a new query engine.
- The notebook experience out of the box is quite good (and I am saying this from the perspective of a person who doesn’t really like notebooks). I like being able to create ad-hoc charts from a result data frame and explore my data right from the same interface.
- Databricks connect (https://docs.databricks.com/dev-tools/databricks-connect.html ) is the simplest (to me) way to connect my IDE on a remote cluster with the minimum amount of fuss. It can be a little capricious, but when writing the book, I’ve used much worse hacks to connect to a remote REPL with Spark enabled…
- Databricks provides additional capabilities (Delta Lake for data warehousing, MLFlow of ML model/experiments management, etc.) which play well with the overall ecosystem.
- The ecosystem is quite consistent around all three major cloud providers (AWS, Azure, GCP), which help if you’re moving around. :)
- If we want to train ML models using pySpark, does the model have to support distributed training?
Spark’s ML model collection all work out of the box. They are all listed here: https://spark.apache.org/docs/latest/api/python/reference/pyspark.ml.html.
Some algorithms naturally lend themselves better to distributed computing and perform (runtime) much better than other. Random Forest for instance distributes super well, GradientBoosting a little less so.
On top of that, you can also use user-defined functions (UDF) to run single-node models in a distributed fashion (the model would run on a single node here). This allows for parallelizing hyper-parameter selection. I am considering to write an article/do a video on the topic as it is quite fun to do!
thank you 🙂
Thank you 🙂
Hello Jonathan, just the kind of book I am interested to read further on!
I have been using Python (pandas/sklearn) for small-medium data related tasks (analysis / ML) and it is the only programming language I have some amount of exposure to.
In order to get started and work further with PySpark, do I need to have some prerequisites in understanding the spark architecture in its native implementation for working with big data?
Also, what is the approach to be taken to work with the scala/java implementation in case the switch has to be made?
TL;DR: I wrote my book for someone like you! 🙂
You don’t have to use Java/Scala to use PySpark, but you are right in saying that the architecture and code approach can be different.
In Part 1 of my book, I discuss about how you can change your way on thinking about transformations vs. actions. Later, in Chapter 11, I discuss about narrow vs. wide transformations to help with understanding why the data needs to move from one node to another.
Everything that you’ve learned so far still carries to the big data world, fortunately 🙂 With my students and employees, the biggest “click” happens when they understand data locality (which data points need to be where in order for an operation to succeed) and they learn to read query plans (Chapter 11 as well) and structure code in a readable fashion.
If you really need to use Java, my now friend Jean-George Perrin wrote Spark In Action from the same publisher and with a more data engineering point of view. You’ll find that the Spark API looks the same regardless of the language.
Hi Jonathan, books seems great. Here are my questions:
- What’s the data type you struggle the most? And the one you have more fun with?
- What is the most common error/assumption people do when using PySpark and general ML pipelines?
- Best tip to make training more efficient?
Thank you for your time!
What’s the data type you struggle the most? And the one you have more fun with?
My favourite by far is multi-dimensional/hierarchical/document data (think JSON). I don’t know why, but having that data model within the data frame (highly performance nested data frames) is awesome. It sometimes also feels like a puzzle to extract and work with the data the best way and I like a challenge 😄.
Binary data in Spark can be a little bit of a pain (think audio, video, images, etc.). Most of the work revolves around treating everything like an independent unit (you could even use a RDD for this). I like doing ML on unstructured data, but prepping in a distributed fashion can sometimes feel a little bit of a pain.
What is the most common error/assumption people do when using PySpark and general ML pipelines?
Easy 🙂 Being too precious with their data pipelines/not planning and iterating. Most data scientists I encountered are very hesitant about training a model. I usually go straight with a simple pipeline that takes the features ready to go and use a simple model. This
- Serves me as an early benchmark.
- Removes some of the magic of the model and helps me destress about metric anxiety.
- Ensures I have a complete modelling pipeline working!
ML pipelines are super cool because you build the components independently of their application. It makes it easy to add/remove some as you go. Because of this, you are responsible to strike the right balance of planning and flexibility. I keep a notebook of what I want to do, what I expect as results, and I build my code as I go.
(Plus, if your model takes time, you can work on the next iteration while it fits!)
Best tip to make training more efficient?
Do you mean “speed up PySpark ML training”?
- Read the SparkUI when working with ML model. It’ll help see the order of operations and review if you are spilling a lot of data.
- Sometimes, you can get a whole lot more performant training by saving the data frame in Parquet on disk and reading it again. It’s like a super-cache that clears the previous operations.
- Use enough memory and compute. Remember that Spark uses memory for both storage of the data and compute (+ the rest that makes a computer work). Don’t skimp on memory, even if it means using a smaller cluster for data transformation and then going all ham on ML.
Hi Jonathan Rioux, I have been working on simple projects as of now. Will this book be a good start for learning about PySpark? Also, are there any specific hardware requirements for running?
I read your message about transfirmations vs actions, but is it useful for someone who is leaarning from scratch and has no idea abojt PySpark and its workings?
If you know some Python, then my book will (I hope!) be useful yes. 🙂 It is to get started with PySpark.
I recommend a computer with at least 8GB of RAM (if you work locally) or access to a cloud provider Spark (there are some free offering, such as Databricks community edition).
Transformations vs. Actions is a pretty important concept in understanding how Spark processes data. It is explained right off the bat in Chapter 1, so you’re not expected to know anything about it first.
That’s amazing! Thank you for answering my question!
Hi Jonathan Rioux , thank you for doing this Q&A
1# How do you estimate the resources needed for a spark job? Is there a tool out there which looks at the volume of data with the spark code to determine the number of executors, cpu & memory needed? Or is it some rule of thumb and repeated iteration to tune out the best config?
#1: As far as I am concerned, no such thing exist at the time…
Size/volume of the data is important (both disk space and memory, because of potential spills), but also the type of operations you plan on doing. If the data can be logically represented by small-ish (single or a few nodes), your Spark code might need less resources because there will be less merry-ing around of the data.
As you write your own programs, and use the Spark UI to review the resources used by the cluster, you’ll be able to adjust as needed. I remember getting a little frustrated at first because it can look like a guessing game, but you end up building your own mental model for data usage.
As an example, for time series (~14 TB of data) using ML modelling and heavy feature engineering, I used comfortably 25 machines x 60 GB of RAM. Some spill, but it was good enough for me.
2# In spark local mode vs cluster mode (YARN) what are the advantages in terms of my job performance? Other than:
a. Data locality. If my RDD partition and HDFS block is on same node hence no shuffle
b. Reliability. If my local executor is flaky, then my whole job performance/reliability suffers. YARN -> multiple machine hence flaky node can be mitigated.
c. Vertical scaling limit. If I need very high memory, CPU counts which the cloud provider doesn’t offer on single instance.
If I manually set num of partitions (via coalesce or repartition) in local mode I can make multiple tasks executing in parallel? Wouldn’t this be faster than cluster mode?
I think you hit the nail on the head. Spark in cluster mode (whether YARN or k8s, I think the other ones are not used much…), you gain the ability to scale beyond a single machine.
coalesce() is, to me, mostly useful when you want to limit the number of partitions (for instance, before writing to disk. For “short lived” data programs, I usually don’t use it much as Spark is pretty smart about reshuffling data in the best way possible. When using older version of Spark, it helped, but since Spark 3.X I’ve been pretty keen on relying on the default behaviour 🙂
The whole essence of Spark is its ability to scale horizontally. If you have a single beefy machine, you might yield better performance from another tool (although Spark single node is pretty speedy, see my previous answers).
3# How significant is the overhead between using Scala for spark versus Python for spark (since Python data structures have to be converted to Scala datatypes) ? Is it worthwhile to convert frequently run Pyspark in production to Scala spark for performance & better resource utilisation?
When using RDD and regular Python UDF, you’ll pay a pretty significant performance tax from serialization/deserialization (serde) from Spark’s data model to something Python can consume.
The data frame API, even in Python, maps to JVM instructions and performance is quite similar to Spark in Java/Scala.
Now, with Pandas UDF and Arrow serialization, you can use Python/Pandas code within PySpark with minimal overhead. Python will not always be as fast as Java/Scala (Pandas/NumPy owe much of their speed to highly optimized low-level code), but the gap is narrowing.
4# Do you think serverless Spark (https://cloud.google.com/solutions/spark) is going to be the next big thing for data analysis & ML similar to managed Kubernetes engine? Do most of the orgs in your experience, run private Spark clusters or use some cloud offering (DataStax, Databricks)
I find the expression “serverless Spark” quite hillarious 😄 because you always are concerned about the servers being used. Most serverless Spark offering rebrand the compute+memory units, which is counter productive as most Spark users understand RAM quantities pretty well.
I think that managed Spark is a pretty seducing option for data analysis and data science. Databricks is the dominating force here, but I’ve seen some great success/used other managed Spark product.
Looking at the link you provided, I think it’s a pretty seducing offering for the cloud provider because you give them the keys to scale for you. In practice, your IT organization might set pretty stiff limitations there. I also have not seen auto-provisioning that was so fast that I was convinced to move away from “provisioning the cluster myself”.
Jonathan Rioux Thank you for all the super detailed answers. 🙂
My pleasure 🙂 Thanks for the excellent questions!
As for memory usage, especially when I get to scale a cluster, I use this image to remind me of the cluster memory vs. the available memory. On top of this, remember that Spark will spill to disk (which is slower, but not catastrophic).
When scaling a cluster, I found that most people try to skimp on memory. If you plan on using TBs of data, it’s not silly to have a lot of RAM (depending on the operations/transformation/modelling), you won’t have fun if you total 500GB of RAM across your cluster.
Hi Jonathan Rioux
- Which is the best tool to use pyspark for creating eda and data visualization? I usually use zeppelin to do this.
- Is there any difference in terms of resources utilisation or speed between using scala-spark and pyspark?
#1 : Within databricks, you can create charts from data frames pretty easily. It uses plot.ly in the back-end. PySpark also has the
describe() methods you can use for diagnosing the data (at a high-level) within columns.
I haven’t used Zeppelin in a long time (and only for the notebook interface) but Jupyter (Spark open-source) or databricks notebook (databricks) should provide the same functionality.
#2 : With the data frame, PySpark and Spark are pretty similar, under the hood, PySpark will call the same JVM methods as Spark/Java. In some cases (for instance, when using the RDD or UDF), you will have differences (because you’re using plain Python/Pandas code and not the optimized Spark routines).
If you are more comfortable using Python, PySpark will serve you quite well!
- How efficient is it to use spark UDFs to perform data processing?
I distinguish two types of UDF
Python UDF are used with the
udf() function/decorator. They merry the data from Spark to Python and then use Python to process the record. Those UDF are usually slower because of the data serialization/deserialization and also because Python can be slower than Scala/Java.
Pandas UDF are used with the
pandas_udf() function/decorator and are much faster than Python UDF. They use Arrow for converting the data from Spark to Pandas. Furthermore, Pandas can be very fast when using vectorized instructions.
UDF are useful when you need to do something that is hard to do with the PySpark API. See them as a specialized tool that unlock additional capabilities, not as a replacement for Spark’s core API.
In terms of efficiency, I read articles claiming that Pandas UDF can sometime be more efficient than Spark core data transformation API (!). I have not seen this in practice, usually witnessing a small slowdown (going through a few examples I wrote for fun — don’t take this as an official benchmark! — I see anything from 0.9-10.2x. I still use them quite a bit for my own data analysis and modelling. :)
Additionally: I often use UDF / Pandas UDF for promoting local Python/Pandas code to PySpark. It allows to scale the code (although more slowly) without needing a re-write. Bonus for me since I can use the old “promoted function” as test when rewriting in “native” Spark. 🙂
I remember a few years ago Apache Zeppelin was getting some traction as a nice and easy way to do analytics with Spark. But I haven’t heard about it for a while. Is it still used? What do you think about it?
I’ve seen in a thread that you mentioned that Jupyter has the same functionality. So it means people prefer to use it more often than Zeppelin?
People in our org use both the ways (Zeppelin + Scala OSS Spark, Jupyter + PySpark) . Personally feel, Zeppelin is one of the best integrations out there if you want to use Scala + OSS Spark with in built viz, versioning, easy configuration of spark (mem, cpu, spark mode, reuse same interpreter across users, user access controls)
If you need such features from Spark + Jupyter, I think you need to manually configure so many things which is a pain.
I have not used Zeppelin in a very long time (~2 years) so my opinions are tainted a little.
Most cloud offering has Jupyter integration baked in which is ok. Databricks still has IMHO the best notebook integration. I think that Zeppelin/Livy had the option to work cross-languages for the longest time, but it’s not something I used extensively.
In the end, I think that comfort level and habit are quite important. There can be a little bit of tool fatigue at times 🙂 But WingCode is right that something configuration/configurability can be a pain.
Another drawback of zeppelin is it gets slow in rendering when creating visualisation
Hi Jonathan Rioux,
I had few more questions 🙂
#1 Is there any advantages using HDFS 3.X over HDFS 2.X with Spark?
I think that Spark uses Hadoop only for storage, so if Hadoop 3.0 is faster, all the better. In practice, I’ve used Spark mostly in a cloud setting (GCS, S3, Blob Storage, DBFS, BigQuery, etc.): it really boils down to where your data is and how you get Spark provisioned.
This might be a little out of my wheelhouse but I hope it answered your question!
Yes answered my question 🙂 Thank you Jonathan
#2 Do you recommend any other profiling tool for OOS spark jobs other than the information we get through spark Web UI ?
I’m not sure if it’s available standalone still anymore, but I was impressed by the work on the Spark UI the folks at data mechanics (https://www.datamechanics.co/) did. This is the only other product I’ve used and can comment on.
Update: https://www.datamechanics.co/delight ← there you go, it’s over there :)
This is a cool UI, unfortunately the dashboard server is not open-source. Thank you for the link Jonathan 🙂
#3 Do you think users of Spark need to have some basic understanding of the internals to use it better compared to other data processing tools like Pandas, Numpy?
I think that every library user should read the documentation/data model of what they use 😉
Spark has a pretty straightforward data model abstraction (the data frame is easy to understand, yet very flexible). For starters, understanding this is enough.
As you go along, it’ll be insightful to gain deeper knowledge of how and where the data is processed, partitions, skew, etc. as it’ll help profile and reason about your programs. It’s not necessary (IMHO) at the start, but becomes important if you want to really grok Spark.
Alexey Grigorev I am seeing some wonderful discussions happening here, wondering if these discussions are saved in a separate document for future reference?
This is how it looks for previous books - https://datatalks.club/books/20210517-grokking-deep-reinforcement-learning.html
https://datatalks.club/books/20211011-mastering-transformers.html here I didn’t see anything hence asked.
I’ll publish the answers eventually. It takes some time, that’s why I batch-process it =)
Oh okay got it thank you so much for this, is there a way we can contribute like in form of financial donations? You are really doing great work!!
Thanks for asking! Yes there’s a way to do it: https://github.com/sponsors/alexeygrigorev
Out of curiosity, is this automatable ?
I have a script that generates a yaml with messages. It automates like 90% of the work.
The part that’s not automated yet is
- Selecting the week
- Cleaning the responses a bit, like removing the announcements
- Putting the yaml to the website
now wondering if the rest of the 10% is doable with github actions 😂
They probably are, I’m sure it’s automatable. Maybe the second one is a bit tricky though
Second one good problem for zoomcamp 🙂
Anyway please for rest of the tasks (at least coding not ML) if you hands let me know
Hi Jonathan Rioux . I have a question about dependency management for PySpark. When we need a library for a PySpark Job, then we may 1) install the library on all nodes or 2) submit all required library with
--py-files option. I do not think that these options are realistic if the dependency is quite large. (e.g. install/update
spacy. It requires around 30 libraries.)
My question is: what is the best approach to submitting a PySpark Job requiring a large dependency?
(If we use Scala, we do not have such a problem, because we can pack all needed packages in a JAR file.)
This is such a good question!
--py-files IIRC does not work with libraries that have C/C++ code, such as
It boils down to which environment you are using. Are you on a cloud installation (databricks, EMR, Glue, HDInsights, Dataproc, etc.) ? If this is the case, you can specify and install dependencies at cluster creation time or runtime. This also has the advantage that you can “version” the dependencies of your cluster. I used dataproc (GCP) for a pretty significant project a few years ago and this is the route we took. Databricks has
dbutils where you can install libraries on the cluster with a simple command from the notebook.
(Databricks, on top of this, has runtime versions that has predictible libraries and versions. I am a big fan of those, just to avoid thinking about which one to use 😄 see for example spacy 3.1.2 on databricks runtine 10.0ml: https://docs.databricks.com/release-notes/runtime/10.0ml.html)
If you are “on-prem” or the cluster is not meant to be ephemeral, you need to be a little more careful with dependencies management. Again, this is product dependent (what do you use to manage your hardware provisioning) but I would assume that orchestration tools can help with this. I think that, in this case, you need to be more conservative with your dependencies to avoid some clash if multiple users are there.
This is one of the cases where I envy the JVM/jar world… 🙂
Another example: installing dependencies on GCP dataproc.
> I would assume that orchestration tools can help with this.
Thanks for your answer! I did not have this viewpoint.