Machine Learning Zoomcamp: Free ML Engineering course. Register here!

DataTalks.Club

Data Analysis with Python and PySpark

by Jonathan Rioux

The book of the week from 25 Oct 2021 to 29 Oct 2021

When it comes to data analytics, it pays to think big. PySpark blends the powerful Spark big data processing engine with the Python programming language to provide a data analysis platform that can scale up for nearly any task. Data Analysis with Python and PySpark is your guide to delivering successful Python-driven data projects.

Packed with relevant examples and essential techniques, this practical book teaches you to build lightning-fast pipelines for reporting, machine learning, and other data-centric tasks. No previous knowledge of Spark is required.

Questions and Answers

Tim Becker

Good morning Jonathan Rioux, very interesting book! I have a few beginners questions.

Mansi Parikh

from another beginner, these were really nice to read so thanks for asking, Tim, and thanks for answering, Jonathan!

Tim Becker
  • When should I start using Spark? I mean how large should my dataset be? Does it make sense to start using Spark, if my dataset still fit into memory, but I expect the size to increase?
Jonathan Rioux

Hi Tim!
This is an excellent question šŸ™‚ I donā€™t have a straight answer, but let me share with you the heuristics that I use when deciding for myself.

  1. PySpark is getting much faster for single-node jobs, so you might be able to have acceptable performance with Spark on a single node right off the bat! See the following link about a discussion about this. https://databricks.com/blog/2021/10/19/introducing-apache-spark-3-2.html
  2. Koalas was introduced in Spark 3 and merged into pyspark.pandas as of Spark 3.2. Now more than ever, you can convert Pandas code to PySpark with a lot less fuss. šŸ™‚
Jonathan Rioux
  1. For memory allocation, I try to have a cluster with enough memory to ā€œstoreā€ my data and have enough room for computation. Data grows quite fast, and if you have a feel that the data source will grow (for instance, historical data), I find it easier to start with PySpark, knowing itā€™ll scale.
    If you need a fast an loose rule for processing data (not counting ML applications), I would say that if you canā€™t get a single machine with 3-5x the RAM your data sits on, you probably want to reach for Spark, just for comfort.
Tim Becker
  • How much worse is the performance with pySpark if it used on a small dataset.
Jonathan Rioux

I think I replied on your previous question šŸ™‚ . The ā€œSpark single-node performance taxā€ shrunk dramatically since Spark 3.0 and even more since Spark 3.2.
https://databricks.com/blog/2021/10/19/introducing-apache-spark-3-2.html
In practice, I find that with very small data sets (a handful of hundred of rows) you will have much worse performance depending on the operations: that being said, itā€™s often a difference of 0.29 sec vs 0.85 sec which I am not too concerned about.

Tim Becker
  • What are the advantages of using databriks?
Jonathan Rioux

Databricks has many things for itself!

  1. Databricks provides proprietary performance improvements over open-source Spark so your jobs may run faster with no changes. I am especially excited about Photon (https://databricks.com/product/photon) which takes your Spark data transformation code through a new query engine.
  2. The notebook experience out of the box is quite good (and I am saying this from the perspective of a person who doesnā€™t really like notebooks). I like being able to create ad-hoc charts from a result data frame and explore my data right from the same interface.
  3. Databricks connect (https://docs.databricks.com/dev-tools/databricks-connect.html ) is the simplest (to me) way to connect my IDE on a remote cluster with the minimum amount of fuss. It can be a little capricious, but when writing the book, Iā€™ve used much worse hacks to connect to a remote REPL with Spark enabledā€¦
  4. Databricks provides additional capabilities (Delta Lake for data warehousing, MLFlow of ML model/experiments management, etc.) which play well with the overall ecosystem.
  5. The ecosystem is quite consistent around all three major cloud providers (AWS, Azure, GCP), which help if youā€™re moving around. :)
Tim Becker
  • If we want to train ML models using pySpark, does the model have to support distributed training?
Jonathan Rioux

Sparkā€™s ML model collection all work out of the box. They are all listed here: https://spark.apache.org/docs/latest/api/python/reference/pyspark.ml.html.
Some algorithms naturally lend themselves better to distributed computing and perform (runtime) much better than other. Random Forest for instance distributes super well, GradientBoosting a little less so.
On top of that, you can also use user-defined functions (UDF) to run single-node models in a distributed fashion (the model would run on a single node here). This allows for parallelizing hyper-parameter selection. I am considering to write an article/do a video on the topic as it is quite fun to do!

Tim Becker

thank you šŸ™‚

Tim Becker

Thank you šŸ™‚

adanai

Hello Jonathan, just the kind of book I am interested to read further on!
I have been using Python (pandas/sklearn) for small-medium data related tasks (analysis / ML) and it is the only programming language I have some amount of exposure to.
In order to get started and work further with PySpark, do I need to have some prerequisites in understanding the spark architecture in its native implementation for working with big data?
Also, what is the approach to be taken to work with the scala/java implementation in case the switch has to be made?

Jonathan Rioux

Hello!
TL;DR: I wrote my book for someone like you! šŸ™‚
You donā€™t have to use Java/Scala to use PySpark, but you are right in saying that the architecture and code approach can be different.
In Part 1 of my book, I discuss about how you can change your way on thinking about transformations vs. actions. Later, in Chapter 11, I discuss about narrow vs. wide transformations to help with understanding why the data needs to move from one node to another.
Everything that youā€™ve learned so far still carries to the big data world, fortunately šŸ™‚ With my students and employees, the biggest ā€œclickā€ happens when they understand data locality (which data points need to be where in order for an operation to succeed) and they learn to read query plans (Chapter 11 as well) and structure code in a readable fashion.
If you really need to use Java, my now friend Jean-George Perrin wrote Spark In Action from the same publisher and with a more data engineering point of view. Youā€™ll find that the Spark API looks the same regardless of the language.

JosƩ Bastos

Hi Jonathan, books seems great. Here are my questions:

  • Whatā€™s the data type you struggle the most? And the one you have more fun with?
  • What is the most common error/assumption people do when using PySpark and general ML pipelines?
  • Best tip to make training more efficient?
    Thank you for your time!
Jonathan Rioux

Hi JosƩ!
Whatā€™s the data type you struggle the most? And the one you have more fun with?
My favourite by far is multi-dimensional/hierarchical/document data (think JSON). I donā€™t know why, but having that data model within the data frame (highly performance nested data frames) is awesome. It sometimes also feels like a puzzle to extract and work with the data the best way and I like a challenge šŸ˜„.
Binary data in Spark can be a little bit of a pain (think audio, video, images, etc.). Most of the work revolves around treating everything like an independent unit (you could even use a RDD for this). I like doing ML on unstructured data, but prepping in a distributed fashion can sometimes feel a little bit of a pain.

Jonathan Rioux

What is the most common error/assumption people do when using PySpark and general ML pipelines?
Easy šŸ™‚ Being too precious with their data pipelines/not planning and iterating. Most data scientists I encountered are very hesitant about training a model. I usually go straight with a simple pipeline that takes the features ready to go and use a simple model. This

  1. Serves me as an early benchmark.
  2. Removes some of the magic of the model and helps me destress about metric anxiety.
  3. Ensures I have a complete modelling pipeline working!
    ML pipelines are super cool because you build the components independently of their application. It makes it easy to add/remove some as you go. Because of this, you are responsible to strike the right balance of planning and flexibility. I keep a notebook of what I want to do, what I expect as results, and I build my code as I go.
    (Plus, if your model takes time, you can work on the next iteration while it fits!)
Jonathan Rioux

Best tip to make training more efficient?
Do you mean ā€œspeed up PySpark ML trainingā€?

  1. Read the SparkUI when working with ML model. Itā€™ll help see the order of operations and review if you are spilling a lot of data.
  2. Sometimes, you can get a whole lot more performant training by saving the data frame in Parquet on disk and reading it again. Itā€™s like a super-cache that clears the previous operations.
  3. Use enough memory and compute. Remember that Spark uses memory for both storage of the data and compute (+ the rest that makes a computer work). Donā€™t skimp on memory, even if it means using a smaller cluster for data transformation and then going all ham on ML.
Asmita

Hi Jonathan Rioux, I have been working on simple projects as of now. Will this book be a good start for learning about PySpark? Also, are there any specific hardware requirements for running?
I read your message about transfirmations vs actions, but is it useful for someone who is leaarning from scratch and has no idea abojt PySpark and its workings?

Jonathan Rioux

Hi Asmita!
If you know some Python, then my book will (I hope!) be useful yes. šŸ™‚ It is to get started with PySpark.
I recommend a computer with at least 8GB of RAM (if you work locally) or access to a cloud provider Spark (there are some free offering, such as Databricks community edition).
Transformations vs. Actions is a pretty important concept in understanding how Spark processes data. It is explained right off the bat in Chapter 1, so youā€™re not expected to know anything about it first.

Asmita

Thatā€™s amazing! Thank you for answering my question!

WingCode

Hi Jonathan Rioux , thank you for doing this Q&A
1# How do you estimate the resources needed for a spark job? Is there a tool out there which looks at the volume of data with the spark code to determine the number of executors, cpu & memory needed? Or is it some rule of thumb and repeated iteration to tune out the best config?

Jonathan Rioux

Hi Wingcode!
Great questions!
#1: As far as I am concerned, no such thing exist at the timeā€¦
Size/volume of the data is important (both disk space and memory, because of potential spills), but also the type of operations you plan on doing. If the data can be logically represented by small-ish (single or a few nodes), your Spark code might need less resources because there will be less merry-ing around of the data.
As you write your own programs, and use the Spark UI to review the resources used by the cluster, youā€™ll be able to adjust as needed. I remember getting a little frustrated at first because it can look like a guessing game, but you end up building your own mental model for data usage.
As an example, for time series (~14 TB of data) using ML modelling and heavy feature engineering, I used comfortably 25 machines x 60 GB of RAM. Some spill, but it was good enough for me.

WingCode

2# In spark local mode vs cluster mode (YARN) what are the advantages in terms of my job performance? Other than:
a. Data locality. If my RDD partition and HDFS block is on same node hence no shuffle
b. Reliability. If my local executor is flaky, then my whole job performance/reliability suffers. YARN -> multiple machine hence flaky node can be mitigated.
c. Vertical scaling limit. If I need very high memory, CPU counts which the cloud provider doesnā€™t offer on single instance.
If I manually set num of partitions (via coalesce or repartition) in local mode I can make multiple tasks executing in parallel? Wouldnā€™t this be faster than cluster mode?

Jonathan Rioux

I think you hit the nail on the head. Spark in cluster mode (whether YARN or k8s, I think the other ones are not used muchā€¦), you gain the ability to scale beyond a single machine.
coalesce() is, to me, mostly useful when you want to limit the number of partitions (for instance, before writing to disk. For ā€œshort livedā€ data programs, I usually donā€™t use it much as Spark is pretty smart about reshuffling data in the best way possible. When using older version of Spark, it helped, but since Spark 3.X Iā€™ve been pretty keen on relying on the default behaviour šŸ™‚
The whole essence of Spark is its ability to scale horizontally. If you have a single beefy machine, you might yield better performance from another tool (although Spark single node is pretty speedy, see my previous answers).

WingCode

3# How significant is the overhead between using Scala for spark versus Python for spark (since Python data structures have to be converted to Scala datatypes) ? Is it worthwhile to convert frequently run Pyspark in production to Scala spark for performance & better resource utilisation?

Jonathan Rioux

When using RDD and regular Python UDF, youā€™ll pay a pretty significant performance tax from serialization/deserialization (serde) from Sparkā€™s data model to something Python can consume.
The data frame API, even in Python, maps to JVM instructions and performance is quite similar to Spark in Java/Scala.
Now, with Pandas UDF and Arrow serialization, you can use Python/Pandas code within PySpark with minimal overhead. Python will not always be as fast as Java/Scala (Pandas/NumPy owe much of their speed to highly optimized low-level code), but the gap is narrowing.

WingCode

4# Do you think serverless Spark (https://cloud.google.com/solutions/spark) is going to be the next big thing for data analysis & ML similar to managed Kubernetes engine? Do most of the orgs in your experience, run private Spark clusters or use some cloud offering (DataStax, Databricks)

Jonathan Rioux

I find the expression ā€œserverless Sparkā€ quite hillarious šŸ˜„ because you always are concerned about the servers being used. Most serverless Spark offering rebrand the compute+memory units, which is counter productive as most Spark users understand RAM quantities pretty well.
I think that managed Spark is a pretty seducing option for data analysis and data science. Databricks is the dominating force here, but Iā€™ve seen some great success/used other managed Spark product.
Looking at the link you provided, I think itā€™s a pretty seducing offering for the cloud provider because you give them the keys to scale for you. In practice, your IT organization might set pretty stiff limitations there. I also have not seen auto-provisioning that was so fast that I was convinced to move away from ā€œprovisioning the cluster myselfā€.

WingCode

Jonathan Rioux Thank you for all the super detailed answers. šŸ™‚

Jonathan Rioux

My pleasure šŸ™‚ Thanks for the excellent questions!

Jonathan Rioux

Hi everyone!
As for memory usage, especially when I get to scale a cluster, I use this image to remind me of the cluster memory vs. the available memory. On top of this, remember that Spark will spill to disk (which is slower, but not catastrophic).
When scaling a cluster, I found that most people try to skimp on memory. If you plan on using TBs of data, itā€™s not silly to have a lot of RAM (depending on the operations/transformation/modelling), you wonā€™t have fun if you total 500GB of RAM across your cluster.

Lavanya M K

Hi Jonathan Rioux

  1. Which is the best tool to use pyspark for creating eda and data visualization? I usually use zeppelin to do this.
  2. Is there any difference in terms of resources utilisation or speed between using scala-spark and pyspark?
Jonathan Rioux

Hello Lavanya!
#1 : Within databricks, you can create charts from data frames pretty easily. It uses plot.ly in the back-end. PySpark also has the summary() and describe() methods you can use for diagnosing the data (at a high-level) within columns.
I havenā€™t used Zeppelin in a long time (and only for the notebook interface) but Jupyter (Spark open-source) or databricks notebook (databricks) should provide the same functionality.

Jonathan Rioux

#2 : With the data frame, PySpark and Spark are pretty similar, under the hood, PySpark will call the same JVM methods as Spark/Java. In some cases (for instance, when using the RDD or UDF), you will have differences (because youā€™re using plain Python/Pandas code and not the optimized Spark routines).
If you are more comfortable using Python, PySpark will serve you quite well!

Lavanya M K
  1. How efficient is it to use spark UDFs to perform data processing?
Jonathan Rioux

I distinguish two types of UDF
Python UDF are used with the udf() function/decorator. They merry the data from Spark to Python and then use Python to process the record. Those UDF are usually slower because of the data serialization/deserialization and also because Python can be slower than Scala/Java.
Pandas UDF are used with the pandas_udf() function/decorator and are much faster than Python UDF. They use Arrow for converting the data from Spark to Pandas. Furthermore, Pandas can be very fast when using vectorized instructions.
UDF are useful when you need to do something that is hard to do with the PySpark API. See them as a specialized tool that unlock additional capabilities, not as a replacement for Sparkā€™s core API.
In terms of efficiency, I read articles claiming that Pandas UDF can sometime be more efficient than Spark core data transformation API (!). I have not seen this in practice, usually witnessing a small slowdown (going through a few examples I wrote for fun ā€” donā€™t take this as an official benchmark! ā€” I see anything from 0.9-10.2x. I still use them quite a bit for my own data analysis and modelling. :)

Jonathan Rioux

Additionally: I often use UDF / Pandas UDF for promoting local Python/Pandas code to PySpark. It allows to scale the code (although more slowly) without needing a re-write. Bonus for me since I can use the old ā€œpromoted functionā€ as test when rewriting in ā€œnativeā€ Spark. šŸ™‚

Alexey Grigorev

I remember a few years ago Apache Zeppelin was getting some traction as a nice and easy way to do analytics with Spark. But I havenā€™t heard about it for a while. Is it still used? What do you think about it?

Alexey Grigorev

Iā€™ve seen in a thread that you mentioned that Jupyter has the same functionality. So it means people prefer to use it more often than Zeppelin?

WingCode

People in our org use both the ways (Zeppelin + Scala OSS Spark, Jupyter + PySpark) . Personally feel, Zeppelin is one of the best integrations out there if you want to use Scala + OSS Spark with in built viz, versioning, easy configuration of spark (mem, cpu, spark mode, reuse same interpreter across users, user access controls)
If you need such features from Spark + Jupyter, I think you need to manually configure so many things which is a pain.

Jonathan Rioux

I have not used Zeppelin in a very long time (~2 years) so my opinions are tainted a little.
Most cloud offering has Jupyter integration baked in which is ok. Databricks still has IMHO the best notebook integration. I think that Zeppelin/Livy had the option to work cross-languages for the longest time, but itā€™s not something I used extensively.

Jonathan Rioux

In the end, I think that comfort level and habit are quite important. There can be a little bit of tool fatigue at times šŸ™‚ But WingCode is right that something configuration/configurability can be a pain.

Lavanya M K

Another drawback of zeppelin is it gets slow in rendering when creating visualisation

WingCode

Hi Jonathan Rioux,
I had few more questions šŸ™‚
#1 Is there any advantages using HDFS 3.X over HDFS 2.X with Spark?

Jonathan Rioux

Good question!
I think that Spark uses Hadoop only for storage, so if Hadoop 3.0 is faster, all the better. In practice, Iā€™ve used Spark mostly in a cloud setting (GCS, S3, Blob Storage, DBFS, BigQuery, etc.): it really boils down to where your data is and how you get Spark provisioned.
This might be a little out of my wheelhouse but I hope it answered your question!

WingCode

Yes answered my question šŸ™‚ Thank you Jonathan

WingCode

#2 Do you recommend any other profiling tool for OOS spark jobs other than the information we get through spark Web UI ?

Jonathan Rioux

Iā€™m not sure if itā€™s available standalone still anymore, but I was impressed by the work on the Spark UI the folks at data mechanics (https://www.datamechanics.co/) did. This is the only other product Iā€™ve used and can comment on.
Update: https://www.datamechanics.co/delight ā† there you go, itā€™s over there :)

WingCode

This is a cool UI, unfortunately the dashboard server is not open-source. Thank you for the link Jonathan šŸ™‚

WingCode

#3 Do you think users of Spark need to have some basic understanding of the internals to use it better compared to other data processing tools like Pandas, Numpy?

Jonathan Rioux

I think that every library user should read the documentation/data model of what they use šŸ˜‰
Spark has a pretty straightforward data model abstraction (the data frame is easy to understand, yet very flexible). For starters, understanding this is enough.
As you go along, itā€™ll be insightful to gain deeper knowledge of how and where the data is processed, partitions, skew, etc. as itā€™ll help profile and reason about your programs. Itā€™s not necessary (IMHO) at the start, but becomes important if you want to really grok Spark.

Doink

Alexey Grigorev I am seeing some wonderful discussions happening here, wondering if these discussions are saved in a separate document for future reference?

Alexey Grigorev

They are!
This is how it looks for previous books - https://datatalks.club/books/20210517-grokking-deep-reinforcement-learning.html

Doink

https://datatalks.club/books/20211011-mastering-transformers.html here I didnā€™t see anything hence asked.

Alexey Grigorev

Iā€™ll publish the answers eventually. It takes some time, thatā€™s why I batch-process it =)

Doink

Oh okay got it thank you so much for this, is there a way we can contribute like in form of financial donations? You are really doing great work!!

Alexey Grigorev

Thanks for asking! Yes thereā€™s a way to do it: https://github.com/sponsors/alexeygrigorev

Wendy Mak

Out of curiosity, is this automatable ?

Alexey Grigorev

I have a script that generates a yaml with messages. It automates like 90% of the work.
The part thatā€™s not automated yet is

  • Selecting the week
  • Cleaning the responses a bit, like removing the announcements
  • Putting the yaml to the website
Wendy Mak

now wondering if the rest of the 10% is doable with github actions šŸ˜‚

Alexey Grigorev

They probably are, Iā€™m sure itā€™s automatable. Maybe the second one is a bit tricky though

Lalit Pagaria

Second one good problem for zoomcamp šŸ™‚
Anyway please for rest of the tasks (at least coding not ML) if you hands let me know

Hironori Sakai

Hi Jonathan Rioux . I have a question about dependency management for PySpark. When we need a library for a PySpark Job, then we may 1) install the library on all nodes or 2) submit all required library with --py-files option. I do not think that these options are realistic if the dependency is quite large. (e.g. install/update spacy. It requires around 30 libraries.)
My question is: what is the best approach to submitting a PySpark Job requiring a large dependency?
(If we use Scala, we do not have such a problem, because we can pack all needed packages in a JAR file.)

Jonathan Rioux

This is such a good question! --py-files IIRC does not work with libraries that have C/C++ code, such as spacy .
It boils down to which environment you are using. Are you on a cloud installation (databricks, EMR, Glue, HDInsights, Dataproc, etc.) ? If this is the case, you can specify and install dependencies at cluster creation time or runtime. This also has the advantage that you can ā€œversionā€ the dependencies of your cluster. I used dataproc (GCP) for a pretty significant project a few years ago and this is the route we took. Databricks has dbutils where you can install libraries on the cluster with a simple command from the notebook.
(Databricks, on top of this, has runtime versions that has predictible libraries and versions. I am a big fan of those, just to avoid thinking about which one to use šŸ˜„ see for example spacy 3.1.2 on databricks runtine 10.0ml: https://docs.databricks.com/release-notes/runtime/10.0ml.html)
If you are ā€œon-premā€ or the cluster is not meant to be ephemeral, you need to be a little more careful with dependencies management. Again, this is product dependent (what do you use to manage your hardware provisioning) but I would assume that orchestration tools can help with this. I think that, in this case, you need to be more conservative with your dependencies to avoid some clash if multiple users are there.
This is one of the cases where I envy the JVM/jar worldā€¦ šŸ™‚

Jonathan Rioux

Another example: installing dependencies on GCP dataproc.
https://cloud.google.com/dataproc/docs/tutorials/python-configuration#image_version_20

Hironori Sakai

> I would assume that orchestration tools can help with this.
Thanks for your answer! I did not have this viewpoint.

To take part in the book of the week event:

  • Register in our Slack
  • Join the #book-of-the-week channel
  • Ask as many questions as you'd like
  • The book authors answer questions from Monday till Thursday
  • On Friday, the authors decide who wins free copies of their book

To see other books, check the the book of the week page.

Subscribe to our weekly newsletter and join our Slack.
We'll keep you informed about our events, articles, courses, and everything else happening in the Club.


DataTalks.Club. Hosted on GitHub Pages. We use cookies.