EDA and Visualization in R
Exploratory Data Analysis refers to the critical process of performing initial investigations on data to discover patterns, spot anomalies, test hypotheses and check assumptions with the help of summary statistics and graphical representations.
I embarked on this to learn about R Markdown, tidyverse library and ggplot function for exploring data and plotting great visualizations. I used the datasets available within the tidyverse library for this purpose. The R Markdown file used is attached below;
EDA and Visualisation in R
Meshach Aderele
2022-10-28
R Markdown
#Working with starwars dataset in TidyVerse
glimpse(starwars)
## Rows: 87
## Columns: 14
## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
## $ sex <chr> "male", "none", "none", "male", "female", "male", "female",…
## $ gender <chr> "masculine", "masculine", "masculine", "masculine", "femini…
## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
## $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
## $ films <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return…
## $ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
## $ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…
head(starwars, 5)
## # A tibble: 5 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sky… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth Va… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Org… 150 49 brown light brown 19 fema… femin…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Do frequency table for starwars
starwars %>%
select(hair_color) %>%
count(hair_color) %>%
arrange(desc(n)) %>%
head()
## # A tibble: 6 × 2
## hair_color n
## <chr> <int>
## 1 none 37
## 2 brown 18
## 3 black 13
## 4 <NA> 5
## 5 white 4
## 6 blond 3
Working with ggplot for BOD dataset in TidyVerse
ggplot(data = BOD,
mapping = aes(x = Time,
y = demand))+
geom_point(size = 5)+
geom_line(colour = "red")
Working with ggplot for CO2 dataset in TidyVerse
###Plot using linear model
CO2 %>%
ggplot(aes(conc, uptake,
colour = Treatment))+
geom_point()+
geom_point(size = 3, alpha = 0.5)+
geom_smooth(method = lm, se= F)+
facet_wrap(~Type)+
labs(title = "Concentration of CO2")+
theme_bw()
## `geom_smooth()` using formula 'y ~ x'
###Plot using boxplot
CO2 %>%
ggplot(aes(Treatment, uptake))+
geom_boxplot()+
geom_point(alpha = 0.5,
aes(size = conc,
color = Plant))+
facet_wrap(~Type)+
coord_flip()+
theme_bw()+
labs("Chilled Vs Non-chilled")
Working with ggplot for mpg dataset in TidyVerse
Plot
head(mpg, 5)
## # A tibble: 5 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
mpg %>%
ggplot(aes(displ, cty))+
geom_point(aes(colour = drv,
size= trans),
alpha = 0.5)+
geom_smooth()+
facet_wrap(~year, nrow = 1)
## Warning: Using size for a discrete variable is not advised.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
###Plot
using linear model method
mpg %>%
ggplot(aes(displ, cty))+
geom_point(aes(colour = drv,
size= trans),
alpha = 0.5)+
geom_smooth(method = lm)+
facet_wrap(~year, nrow = 1)+
labs(x = "Engine Size", y = "MPG in The City",
title = "Fuel Efficiency")+
theme_bw()
## Warning: Using size for a discrete variable is not advised.
## `geom_smooth()` using formula 'y ~ x'
Lessons Learned
Working on this project improved my proficiency in the area of exploratory data analysis. I learned to view my data to spot anomalies. I also learned how to make plots with good aesthetics. This is really interesting to me because of one of the Laws of User Experience - The aesthetic Usability Effect, which refers to users' tendency to perceive attractive products as more usable. People tend to believe that things that look better will work better. With good aesthetics, anyone using the plots to gain insights into the data will find the plots very usable.