EDA and Visualization in R

Exploratory Data Analysis refers to the critical process of performing initial investigations on data to discover patterns, spot anomalies, test hypotheses and check assumptions with the help of summary statistics and graphical representations.

I embarked on this to learn about R Markdown, tidyverse library and ggplot function for exploring data and plotting great visualizations. I used the datasets available within the tidyverse library for this purpose. The R Markdown file used is attached below;

R Markdown

#Working with starwars dataset in TidyVerse

glimpse(starwars)
## Rows: 87
## Columns: 14
## $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
## $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
## $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
## $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
## $ sex        <chr> "male", "none", "none", "male", "female", "male", "female",…
## $ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini…
## $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
## $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
## $ films      <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return…
## $ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
## $ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…
head(starwars, 5)
## # A tibble: 5 × 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Luke Sky…    172    77 blond      fair       blue            19   male  mascu…
## 2 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu…
## 3 R2-D2         96    32 <NA>       white, bl… red             33   none  mascu…
## 4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
## 5 Leia Org…    150    49 brown      light      brown           19   fema… femin…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Do frequency table for starwars

starwars %>%
  select(hair_color) %>%
  count(hair_color) %>%
  arrange(desc(n)) %>%
  head()
## # A tibble: 6 × 2
##   hair_color     n
##   <chr>      <int>
## 1 none          37
## 2 brown         18
## 3 black         13
## 4 <NA>           5
## 5 white          4
## 6 blond          3

Working with ggplot for BOD dataset in TidyVerse

ggplot(data = BOD,
       mapping = aes(x = Time,
                     y = demand))+
  geom_point(size = 5)+
  geom_line(colour = "red")

Working with ggplot for CO2 dataset in TidyVerse

###Plot using linear model

CO2 %>% 
  ggplot(aes(conc, uptake, 
             colour = Treatment))+
  geom_point()+
  geom_point(size = 3, alpha = 0.5)+
  geom_smooth(method = lm, se= F)+
  facet_wrap(~Type)+
  labs(title = "Concentration of CO2")+
  theme_bw()
## `geom_smooth()` using formula 'y ~ x'

###Plot using boxplot

CO2 %>% 
  ggplot(aes(Treatment, uptake))+
  geom_boxplot()+
  geom_point(alpha = 0.5,
             aes(size = conc,
                 color = Plant))+
  facet_wrap(~Type)+
  coord_flip()+
  theme_bw()+
  labs("Chilled Vs Non-chilled")

Working with ggplot for mpg dataset in TidyVerse

Plot

head(mpg, 5)
## # A tibble: 5 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
mpg %>%
  ggplot(aes(displ, cty))+
  geom_point(aes(colour = drv,
                 size= trans),
             alpha = 0.5)+
  geom_smooth()+
  facet_wrap(~year, nrow = 1)
## Warning: Using size for a discrete variable is not advised.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

###Plot using linear model method

mpg %>%
  ggplot(aes(displ, cty))+
  geom_point(aes(colour = drv,
                 size= trans),
             alpha = 0.5)+
  geom_smooth(method = lm)+
  facet_wrap(~year, nrow = 1)+
  labs(x = "Engine Size", y = "MPG in The City",
       title = "Fuel Efficiency")+
  theme_bw()
## Warning: Using size for a discrete variable is not advised.
## `geom_smooth()` using formula 'y ~ x'

Lessons Learned

Working on this project improved my proficiency in the area of exploratory data analysis. I learned to view my data to spot anomalies. I also learned how to make plots with good aesthetics. This is really interesting to me because of one of the Laws of User Experience - The aesthetic Usability Effect, which refers to users' tendency to perceive attractive products as more usable. People tend to believe that things that look better will work better. With good aesthetics, anyone using the plots to gain insights into the data will find the plots very usable.