English | MP4 | AVC 1280×720 | AAC 44KHz 2ch | 10h 31m | 1.58 GB
Think big about your data! PySpark brings the powerful Spark big data processing engine to the Python ecosystem, letting you seamlessly scale up your data tasks and create lightning-fast pipelines.
In Data Analysis with Python and PySpark you will learn how to:
- Manage your data as it scales across multiple machines
- Scale up your data programs with full confidence
- Read and write data to and from a variety of sources and formats
- Deal with messy data with PySpark’s data manipulation functionality
- Discover new data sets and perform exploratory data analysis
- Build automated data pipelines that transform, summarize, and get insights from data
- Troubleshoot common PySpark errors
- Creating reliable long-running jobs
Data Analysis with Python and PySpark is your guide to delivering successful Python-driven data projects. Packed with relevant examples and essential techniques, this practical book teaches you to build pipelines for reporting, machine learning, and other data-centric tasks. Quick exercises in every chapter help you practice what you’ve learned, and rapidly start implementing PySpark into your data systems. No previous knowledge of Spark is required.
The Spark data processing engine is an amazing analytics factory: raw data comes in, insight comes out. PySpark wraps Spark’s core engine with a Python-based API. It helps simplify Spark’s steep learning curve and makes this powerful tool available to anyone working in the Python data ecosystem.
Data Analysis with Python and PySpark helps you solve the daily challenges of data science with PySpark. You’ll learn how to scale your processing capabilities across multiple machines while ingesting data from any source—whether that’s Hadoop clusters, cloud data storage, or local data files. Once you’ve covered the fundamentals, you’ll explore the full versatility of PySpark by building machine learning pipelines, and blending Python, pandas, and PySpark code.
What’s inside
- Organizing your PySpark code
- Managing your data, no matter the size
- Scale up your data programs with full confidence
- Troubleshooting common data pipeline problems
- Creating reliable long-running jobs
Table of Contents
Your very own factory How PySpark works
Filtering rows
Ingest and explore Setting the stage for data transformation
Mapping our program
Simple column transformations Moving from a sentence to a list of words
Summary
Your first data program in PySpark
Ordering the results on the screen using orderBy
Putting it all together Counting
Scaling up our word frequency program
Submitting and scaling your first PySpark program
Summary
Using spark-submit to launch your program in batch mode
What didn t happen in this chapter
Writing data from a data frame
Analyzing tabular data with pyspark.sql
PySpark for analyzing and processing tabular data
Reading and assessing delimited data in PySpark
Summary
The basics of data manipulation Selecting, dropping, renaming, ordering, diagnosing
Data frame gymnastics Joining and grouping
Summarizing the data via groupby and GroupedData
Summary
Taking care of null values Drop and fill
What was our question again Our end-to-end program
Breaking the second dimension with complex data types
Building and using the data frame schema
Multidimensional data frames Using PySpark with JSON data
Putting it all together Reducing duplicate data with complex data types
Summary
The struct Nesting columns within columns
Bilingual PySpark Blending Python and SQL code
Conclusion
Preparing a data frame for SQL
SQL and PySpark
Simplifying our code Blending SQL and Python
Summary
Using SQL-like syntax within data frame methods
Extending PySpark with Python RDD and UDFs
Summary
Using Python to extend PySpark via UDFs
Big data is just a lot of small data Using pandas UDFs
Summary
UDFs on grouped data Aggregate and apply
What to use, when
Beyond summarizing Using ranking and analytical functions
Flex those windows! Using row and range boundaries
Going full circle Using UDFs within windows
Look in the window The main steps to a successful window function
Summary
Your data under a different lens Window functions
Faster PySpark Understanding Spark s query planning
Summary
Thinking about performance Operations and memory
Feature creation and refinement
Feature preparation with transformers and estimators
Setting the stage Preparing features for machine learning
Summary
Building a (complete) machine learning pipeline
Evaluating and optimizing our model
Getting the biggest drivers from our model Extracting the coefficients
Robust machine learning with ML Pipelines
Summary
Building custom ML transformers and estimators
Creating your own estimator
Summary
Using our transformer and estimator in an ML pipeline
Get acquainted First steps in PySpark
Get proficient Translate your ideas into code
Get confident Using machine learning with PySpark
Packing and unpacking arguments (args and kwargs)
Python closures and the PySpark transform() method
Python decorators Wrapping a function to change its behavior
Python s typing and mypypyright
Some useful Python concepts
Introduction
Summary
What do I need to get started
What will you learn in this book
Resolve the captcha to access the links!