English | MP4 | AVC 1280×720 | AAC 48KHz 2ch | 1h 54m | 265 MB
Data integrity is the new focal point of the data science revolution. Now that everybody is onboard with the role of data in people’s lives and business, it’s not an unfair question to ask, “Can you prove that your data is accurate?” In this course, you can learn how to identify and address many of the data integrity issues facing modern data scientists, using R and the tidyverse. Discover how to handle missing values and duplicated data. Find out how to convert data between different units and tackle poorly formatted text. Plus, learn how to detect outliers, address structural issues, and identify red flags that indicate potential data quality issues.
Where possible, instructor Mike Chapple shows how to correct the issues using R, but the same principles can be applied to any statistical programing language.
Topics include:
- Missing data
- Duplicate rows and values
- Converting data
- Formatting data
- Working with tidy data
- Tidying data sets
- Dealing with suspicious data
Table of Contents
Introduction
1 Data is messy
2 What you need to know
Missing Data
3 Types of missing data
4 Missing values
5 Missing rows
6 Aggregations and missing values
Duplicated Data
7 Duplicated rows and values
8 Aggregations in the data set
Formatting Data
9 Converting dates
10 Unit conversions
11 Numbers stored as text
12 Text improperly converted to numbers
13 Inconsistent spellings
Outliers
14 Screening for outliers
15 Handling outliers
16 Outliers use case
17 Outliers in subgroups
18 Detecting illogical values
Tidy Data
19 What is tidy data
20 Variables observations and values
21 Common data problems
22 Wide vs. long data sets
23 Making wide data sets long
24 Making long data sets wide
Red Flags
25 Suspicious values
26 Suspicious multiples
Conclusion
27 What s next
Resolve the captcha to access the links!