Step 1: Load Required Libraries¶
In this step, we load all the necessary R libraries that we will use throughout the analysis.
These include:
tidyverse
for data manipulation and visualizationggplot2
andggpubr
for creating plots and showing correlationslubridate
for handling date-time formatsjanitor
for cleaning column namesdplyr
andtidyr
for tidy data operations
Make sure to install any missing packages using install.packages()
before loading them.
# Install missing packages (run only if needed)
install.packages("tidyverse")
install.packages("ggpubr")
install.packages("lubridate")
install.packages("janitor")
Warning message: "package 'tidyverse' is in use and will not be installed" Warning message: "package 'ggpubr' is in use and will not be installed" Warning message: "package 'lubridate' is in use and will not be installed" Warning message: "package 'janitor' is in use and will not be installed"
# Load core packages
library(tidyverse) # For data manipulation and visualization
library(ggplot2) # For plotting
library(ggpubr) # For correlation and advanced plots
library(tidyr) # For tidy data formatting
library(dplyr) # For wrangling
library(lubridate) # For working with date-time
library(janitor) # To clean column names
Warning message: "package 'tidyverse' was built under R version 4.3.3" Warning message: "package 'ggplot2' was built under R version 4.3.3" Warning message: "package 'tidyr' was built under R version 4.3.3" Warning message: "package 'readr' was built under R version 4.3.3" Warning message: "package 'purrr' was built under R version 4.3.1" Warning message: "package 'dplyr' was built under R version 4.3.3" Warning message: "package 'forcats' was built under R version 4.3.1" Warning message: "package 'lubridate' was built under R version 4.3.3" ── Attaching core tidyverse packages ───────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.4 ✔ readr 2.1.5 ✔ forcats 1.0.0 ✔ stringr 1.5.0 ✔ ggplot2 3.5.1 ✔ tibble 3.2.1 ✔ lubridate 1.9.4 ✔ tidyr 1.3.1 ✔ purrr 1.0.2 ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors Warning message: "package 'janitor' was built under R version 4.3.3" Attaching package: 'janitor' The following objects are masked from 'package:stats': chisq.test, fisher.test
Step 2: Import the Datasets¶
In this step, we load all the Fitbit data files collected between March 12, 2016 and April 11, 2016. These include daily, hourly, and minute-level datasets covering activities, calories, steps, sleep, heart rate, and weight logs.
⚠️ Make sure the file paths are correctly set based on your local file system.
# Daily data
daily_activity <- read_csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/dailyActivity_merged.csv")
daily_calories <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/dailyCalories_merged.csv")
daily_intensities <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/dailyIntensities_merged.csv")
daily_steps <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/dailySteps_merged.csv")
# Hourly data
hourly_calories <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/hourlyCalories_merged.csv")
hourly_intensities <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/hourlyIntensities_merged.csv")
hourly_steps <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/hourlySteps_merged.csv")
# Minute-level data
minute_calories <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/minuteCaloriesNarrow_merged.csv")
minute_intensities <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/minuteIntensitiesNarrow_merged.csv")
minute_MET <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/minuteMETsNarrow_merged.csv")
minute_sleep <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/minuteSleep_merged.csv")
minute_steps <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/minuteStepsNarrow_merged.csv")
# Other data
heartrate_seconds <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/heartrate_seconds_merged.csv")
day_sleep <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/sleepDay_merged.csv")
weight_log <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/weightLogInfo_merged.csv")
Rows: 940 Columns: 15 ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────── Delimiter: "," chr (1): ActivityDate dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi... ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Step 3: Cleaning and Inspecting the Data¶
Now that we have imported the datasets, we will:
- Inspect the structure of the datasets
- Convert date columns to proper date formats
- Identify and remove duplicates
- Check for missing values (NAs)
### 🧹 Clean and inspect key datasets:
# Preview the daily activity data
head(daily_activity)
# Check structure to identify incorrect data types (e.g., ActivityDate as character)
str(daily_activity)
# Convert date columns to POSIXct or Date format
daily_activity$ActivityDate <- as.POSIXct(daily_activity$ActivityDate, format = "%m/%d/%Y")
daily_calories$ActivityDay <- as.POSIXct(daily_calories$ActivityDay, format = "%m/%d/%Y")
daily_intensities$ActivityDay <- as.POSIXct(daily_intensities$ActivityDay, format = "%m/%d/%Y")
daily_steps$ActivityDay <- as.POSIXct(daily_steps$ActivityDay, format = "%m/%d/%Y")
heartrate_seconds$Time <- as.POSIXct(heartrate_seconds$Time, format = "%m/%d/%Y %I:%M:%S %p", tz = Sys.timezone())
hourly_calories$ActivityHour <- as.POSIXct(hourly_calories$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p", tz = Sys.timezone())
hourly_intensities$ActivityHour <- as.POSIXct(hourly_intensities$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p", tz = Sys.timezone())
hourly_steps$ActivityHour <- as.POSIXct(hourly_steps$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p", tz = Sys.timezone())
Id | ActivityDate | TotalSteps | TotalDistance | TrackerDistance | LoggedActivitiesDistance | VeryActiveDistance | ModeratelyActiveDistance | LightActiveDistance | SedentaryActiveDistance | VeryActiveMinutes | FairlyActiveMinutes | LightlyActiveMinutes | SedentaryMinutes | Calories |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
1503960366 | 4/12/2016 | 13162 | 8.50 | 8.50 | 0 | 1.88 | 0.55 | 6.06 | 0 | 25 | 13 | 328 | 728 | 1985 |
1503960366 | 4/13/2016 | 10735 | 6.97 | 6.97 | 0 | 1.57 | 0.69 | 4.71 | 0 | 21 | 19 | 217 | 776 | 1797 |
1503960366 | 4/14/2016 | 10460 | 6.74 | 6.74 | 0 | 2.44 | 0.40 | 3.91 | 0 | 30 | 11 | 181 | 1218 | 1776 |
1503960366 | 4/15/2016 | 9762 | 6.28 | 6.28 | 0 | 2.14 | 1.26 | 2.83 | 0 | 29 | 34 | 209 | 726 | 1745 |
1503960366 | 4/16/2016 | 12669 | 8.16 | 8.16 | 0 | 2.71 | 0.41 | 5.04 | 0 | 36 | 10 | 221 | 773 | 1863 |
1503960366 | 4/17/2016 | 9705 | 6.48 | 6.48 | 0 | 3.19 | 0.78 | 2.51 | 0 | 38 | 20 | 164 | 539 | 1728 |
spc_tbl_ [940 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ Id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ... $ ActivityDate : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ... $ TotalSteps : num [1:940] 13162 10735 10460 9762 12669 ... $ TotalDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ... $ TrackerDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ... $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ... $ VeryActiveDistance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ... $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ... $ LightActiveDistance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ... $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ... $ VeryActiveMinutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ... $ FairlyActiveMinutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ... $ LightlyActiveMinutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ... $ SedentaryMinutes : num [1:940] 728 776 1218 726 773 ... $ Calories : num [1:940] 1985 1797 1776 1745 1863 ... - attr(*, "spec")= .. cols( .. Id = col_double(), .. ActivityDate = col_character(), .. TotalSteps = col_double(), .. TotalDistance = col_double(), .. TrackerDistance = col_double(), .. LoggedActivitiesDistance = col_double(), .. VeryActiveDistance = col_double(), .. ModeratelyActiveDistance = col_double(), .. LightActiveDistance = col_double(), .. SedentaryActiveDistance = col_double(), .. VeryActiveMinutes = col_double(), .. FairlyActiveMinutes = col_double(), .. LightlyActiveMinutes = col_double(), .. SedentaryMinutes = col_double(), .. Calories = col_double() .. ) - attr(*, "problems")=<externalptr>
# Check for duplicates in key datasets
sum(duplicated(daily_activity))
sum(duplicated(daily_calories))
sum(duplicated(daily_intensities))
sum(duplicated(daily_steps))
sum(duplicated(day_sleep))
sum(duplicated(heartrate_seconds))
sum(duplicated(hourly_calories))
sum(duplicated(hourly_intensities))
sum(duplicated(hourly_steps))
sum(duplicated(weight_log))
# Remove duplicates if found
day_sleep <- distinct(day_sleep)
# Check for missing values
sum(is.na(daily_activity))
sum(is.na(daily_calories))
sum(is.na(daily_intensities))
sum(is.na(daily_steps))
sum(is.na(day_sleep))
sum(is.na(heartrate_seconds))
sum(is.na(weight_log))
head(weight_log)
Id | Date | WeightKg | WeightPounds | BMI | IsManualReport | LogId | |
---|---|---|---|---|---|---|---|
<dbl> | <chr> | <dbl> | <dbl> | <dbl> | <chr> | <dbl> | |
1 | 1503960366 | 5/2/2016 11:59:59 PM | 52.6 | 115.9631 | 22.65 | True | 1.462234e+12 |
2 | 1503960366 | 5/3/2016 11:59:59 PM | 52.6 | 115.9631 | 22.65 | True | 1.462320e+12 |
3 | 1927972279 | 4/13/2016 1:08:52 AM | 133.5 | 294.3171 | 47.54 | False | 1.460510e+12 |
4 | 2873212765 | 4/21/2016 11:59:59 PM | 56.7 | 125.0021 | 21.45 | True | 1.461283e+12 |
5 | 2873212765 | 5/12/2016 11:59:59 PM | 57.3 | 126.3249 | 21.69 | True | 1.463098e+12 |
6 | 4319703577 | 4/17/2016 11:59:59 PM | 72.4 | 159.6147 | 27.45 | True | 1.460938e+12 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
# Identify specific columns in weight_log with missing values
colSums(is.na(weight_log))
# Drop the 'Fat' column due to excessive missing values
weight_log <- select(weight_log, -Fat)
# Confirm the column is removed
head(weight_log)
- Id
- 0
- Date
- 0
- WeightKg
- 0
- WeightPounds
- 0
- BMI
- 0
- IsManualReport
- 0
- LogId
- 0
Id | Date | WeightKg | WeightPounds | BMI | IsManualReport | LogId | |
---|---|---|---|---|---|---|---|
<dbl> | <chr> | <dbl> | <dbl> | <dbl> | <chr> | <dbl> | |
1 | 1503960366 | 5/2/2016 11:59:59 PM | 52.6 | 115.9631 | 22.65 | True | 1.462234e+12 |
2 | 1503960366 | 5/3/2016 11:59:59 PM | 52.6 | 115.9631 | 22.65 | True | 1.462320e+12 |
3 | 1927972279 | 4/13/2016 1:08:52 AM | 133.5 | 294.3171 | 47.54 | False | 1.460510e+12 |
4 | 2873212765 | 4/21/2016 11:59:59 PM | 56.7 | 125.0021 | 21.45 | True | 1.461283e+12 |
5 | 2873212765 | 5/12/2016 11:59:59 PM | 57.3 | 126.3249 | 21.69 | True | 1.463098e+12 |
6 | 4319703577 | 4/17/2016 11:59:59 PM | 72.4 | 159.6147 | 27.45 | True | 1.460938e+12 |
Step 4: Explore User Coverage and Most Tracked Metrics¶
Let’s now explore how many unique users are represented in each dataset. This helps us understand which activities or measurements are most commonly tracked by users.
We’ll:
- Count distinct user IDs (
Id
) in each dataset - Visualize the number of users per dataset
- Add short labels for easier plotting
# Define metric names and count of unique user IDs
tracked_metrics <- c("daily_activity", "daily_calories", "daily_intensities",
"daily_steps", "sleep", "heartrate_seconds", "weight")
unique_ids <- c(n_distinct(daily_activity$Id),
n_distinct(daily_calories$Id),
n_distinct(daily_intensities$Id),
n_distinct(daily_steps$Id),
n_distinct(day_sleep$Id),
n_distinct(heartrate_seconds$Id),
n_distinct(weight_log$Id))
# Create summary data frame
most_tracked_items <- data.frame(tracked_metrics, unique_ids)
# Add short labels for plotting
most_tracked_items$short_labels <- abbreviate(most_tracked_items$tracked_metrics, minlength = 5)
# Bar chart to visualize most tracked activities
ggplot(data = most_tracked_items, aes(x = short_labels, y = unique_ids)) +
geom_bar(stat = "identity", fill = "saddlebrown") +
geom_text(aes(label = unique_ids), vjust = 1.6, color = "white", size = 4) +
labs(title = "The Most Tracked Activities Among Users",
x = "Tracked Metrics",
y = "Number of Users") +
theme_minimal()
Step 5: Feature Engineering — Add Total Minutes Column and Activity Summary¶
In this step, we:
- Create a new column
total_minutes
in thedaily_activity
dataset - Summarize active minutes and calories
- Prepare for analyzing user behavior based on how long they wear the device
# Create a new column for total minutes worn per day
daily_activity <- daily_activity %>%
mutate(total_minutes = VeryActiveMinutes + LightlyActiveMinutes + FairlyActiveMinutes + SedentaryMinutes)
# Preview the new column
head(daily_activity %>% select(Id, total_minutes, VeryActiveMinutes, LightlyActiveMinutes, FairlyActiveMinutes, SedentaryMinutes))
# Calculate average minutes worn per day by each user
avg_mins_worn <- daily_activity %>%
group_by(Id) %>%
summarise(average_mins = mean(total_minutes))
# Convert minutes to hours and calculate percentage of the day
avg_mins_worn <- avg_mins_worn %>%
mutate(avg_hr_worn = average_mins / 60,
percentage_of_day = (avg_hr_worn / 24) * 100)
# View summary
summary(avg_mins_worn)
Id | total_minutes | VeryActiveMinutes | LightlyActiveMinutes | FairlyActiveMinutes | SedentaryMinutes |
---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
1503960366 | 1094 | 25 | 328 | 13 | 728 |
1503960366 | 1033 | 21 | 217 | 19 | 776 |
1503960366 | 1440 | 30 | 181 | 11 | 1218 |
1503960366 | 998 | 29 | 209 | 34 | 726 |
1503960366 | 1040 | 36 | 221 | 10 | 773 |
1503960366 | 761 | 38 | 164 | 20 | 539 |
Id average_mins avg_hr_worn percentage_of_day Min. :1.504e+09 Min. : 911 Min. :15.18 Min. :63.26 1st Qu.:2.347e+09 1st Qu.:1035 1st Qu.:17.25 1st Qu.:71.88 Median :4.445e+09 Median :1323 Median :22.06 Median :91.91 Mean :4.857e+09 Mean :1224 Mean :20.40 Mean :85.02 3rd Qu.:6.962e+09 3rd Qu.:1419 3rd Qu.:23.64 3rd Qu.:98.52 Max. :8.878e+09 Max. :1439 Max. :23.99 Max. :99.94
Extra Step: Quantile Analysis of Percentage of Time Device Worn¶
The following code calculates quantiles of the percentage_of_day
variable, helping us understand how consistently users wore their Fitbit devices.
# Basic quantiles (min, 25th, 50th, 75th, max)
quantile(avg_mins_worn$percentage_of_day, probs = c(0, 0.25, 0.5, 0.75, 1))
# Specific percentiles (65th and 70th)
quantile(avg_mins_worn$percentage_of_day, probs = c(0.65, 0.70))
# Convert minutes to hours and calculate percentage of the day
avg_mins_worn <- avg_mins_worn %>%
mutate(avg_hr_worn = average_mins / 60,
percentage_of_day = (avg_hr_worn / 24) * 100)
head (avg_mins_worn)
Id | average_mins | avg_hr_worn | percentage_of_day |
---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> |
1503960366 | 1125.968 | 18.76613 | 78.19220 |
1624580081 | 1425.710 | 23.76183 | 99.00762 |
1644430081 | 1371.267 | 22.85444 | 95.22685 |
1844505072 | 1323.484 | 22.05806 | 91.90860 |
1927972279 | 1358.097 | 22.63495 | 94.31228 |
2022484408 | 1425.677 | 23.76129 | 99.00538 |
# Quantiles of percentage_of_day
quantile(avg_mins_worn$percentage_of_day, probs = c(0, 0.25, 0.5, 0.75, 1))
- 0%
- 63.2616487455197
- 25%
- 71.8794802867384
- 50%
- 91.9086021505376
- 75%
- 98.5208333333333
- 100%
- 99.937865497076
Quantile Analysis: Percentage of Time Device Was Worn¶
To understand how consistently users wore their Fitbit devices, we calculated the quantiles of the percentage_of_day
variable — which represents how much of the day (in %) each user wore their device on average.
📊 Quantile Results:¶
Percentile | Value (%) | Interpretation |
---|---|---|
0% (Min) | 63.26 | The least active user wore the device ~63% of the day |
25% | 71.88 | 25% of users wore their device less than 71.9% of the day |
50% (Median) | 91.91 | Half the users wore the device more than 91.9% of the day |
75% | 98.52 | 75% of users wore it less than 98.5%, while 25% wore it nearly all day |
100% (Max) | 99.94 | The most consistent user wore the device almost 100% of the time |
These results suggest that the majority of users were highly engaged, with over 50% wearing their devices more than 90% of the day.
This insight supports Bellabeat’s opportunity to build long-term engagement strategies based on user consistency.
Step 6: Sleep Pattern Analysis and Merging Datasets¶
In this step, we:
- Merge
daily_activity
andday_sleep
onId
andDate
- Ensure both date columns have consistent formats
- Create new variables: total sleep hours, time in bed, time to fall asleep
- Categorize users into sleep pattern groups
# Rename date columns for merging
daily_activity <- daily_activity %>% rename(Date = ActivityDate)
day_sleep <- day_sleep %>% rename(Date = SleepDay)
# Ensure consistent date formats
daily_activity$Date <- as.Date(daily_activity$Date, format = "%m/%d/%Y")
day_sleep$Date <- as.Date(day_sleep$Date, format = "%m/%d/%Y %I:%M:%S %p")
# Merge datasets by Id and Date
combined_data <- left_join(daily_activity, day_sleep, by = c("Id", "Date"))
# Remove duplicates and unnecessary columns
combined_data <- distinct(combined_data)
combined_data <- select(combined_data, -LoggedActivitiesDistance)
# Filter out rows without sleep data
combined_data_filtered <- combined_data %>% filter(!is.na(TotalSleepRecords))
# Create new variables for sleep analysis
combined_data_filtered <- combined_data_filtered %>%
mutate(Hourssleep = TotalMinutesAsleep / 60,
Hoursinbed = TotalTimeInBed / 60,
TofallAsleep = Hoursinbed - Hourssleep)
# Categorize users based on sleep duration
combined_data_filtered <- combined_data_filtered %>%
mutate(SleepPatterns = case_when(
Hourssleep <= 5.99 ~ "Less Sleep",
Hourssleep >= 6 & Hourssleep <= 9.99 ~ "Enough Sleep",
Hourssleep >= 10 ~ "More Sleep"
))
# Reorder SleepPatterns factor levels for logical display
Sleep_Patterns$SleepPatterns <- factor(Sleep_Patterns$SleepPatterns,
levels = c("Less Sleep", "Enough Sleep", "More Sleep"))
# Improved color-mapped pie chart
ggplot(Sleep_Patterns, aes(x = "", y = Percentage, fill = SleepPatterns)) +
geom_bar(stat = "identity", linewidth = 1, color = "gold") +
coord_polar("y") +
geom_text(aes(label = paste0(Percentage, "%")), position = position_stack(vjust = 0.5)) +
labs(title = "Sleep Pattern Distribution Among Users") +
scale_fill_manual(values = c("Less Sleep" = "#08306B",
"Enough Sleep" = "#6BAED6",
"More Sleep" = "#C6DBEF"),
name = "Daily Sleep Hours") +
theme_classic() +
theme(axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
plot.title = element_text(hjust = 0.5, size = 14))
Sleep Pattern Distribution¶
This pie chart presents the distribution of users by sleep duration, categorized into:
- Less Sleep (≤ 6 hours) – shown in darkest blue
- Enough Sleep (6–9.9 hours) – shown in medium blue
- More Sleep (≥ 10 hours) – shown in lightest blue
By mapping color intensity to sleep quality, the visualization more intuitively highlights users who may benefit from improved sleep habits — a valuable insight for Bellabeat’s wellness campaigns.
Why This Visualization Is a Strong Addition¶
✅ Clear story in one view¶
- The chart shows that 71% of users get "Enough Sleep",
- 24% get "Less Sleep", and
- only 5% get "More Sleep".
- This highlights a clear behavioral pattern and sleep consistency among users.
✅ Supports marketing decisions¶
- Bellabeat could focus on the 24% of users with less sleep, offering:
- Wellness coaching
- Sleep reminder notifications
- Educational content about healthy sleep routines
✅ Good user experience (UX)¶
- The pie chart is simple and clean
- Percentages are clearly labeled
- Color intensity aligns with sleep quality, making the chart intuitive and easy to read
Step 7: Correlation Analysis — Steps, Calories, and Intensity¶
In this step, we examine how physical activity metrics relate to calories burned. This includes:
- Correlation between daily steps and calories
- Correlation between total active minutes and calories
- Visualizing these relationships with scatter plots and trend lines
We use the stat_cor()
function from ggpubr
to display Pearson correlation coefficients.
# Daily Steps vs. Calories
daily_activity %>%
ggplot(aes(x = TotalSteps, y = Calories)) +
geom_jitter(alpha = 0.5, color = "darkgreen") +
geom_smooth(method = "lm", se = TRUE, color = "black") +
stat_cor(method = "pearson") +
labs(title = "Daily Steps vs. Calories Burned",
x = "Total Daily Steps",
y = "Calories Burned")
# Total Minutes Worn vs. Calories
daily_activity %>%
ggplot(aes(x = total_minutes, y = Calories)) +
geom_jitter(alpha = 0.5, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "black") +
stat_cor(method = "pearson") +
labs(title = "Total Minutes Worn vs. Calories Burned",
x = "Total Minutes Worn per Day",
y = "Calories Burned")
`geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x'
Insights: Steps vs. Calories Burned¶
The scatter plot above reveals a statistically significant positive correlation between the number of daily steps and calories burned (R = 0.59, p < 0.001).
🔍 Interpretation:¶
- As users increase their step count, their calories burned also increase.
- This trend suggests that walking regularly has a meaningful impact on daily energy expenditure.
- The relationship appears fairly linear, especially between 2,000 and 15,000 steps per day.
💡 Implications for Bellabeat:¶
- Reinforces the value of promoting daily step goals to users.
- Bellabeat can create motivational programs or alerts encouraging consistent walking behavior to improve calorie burn and overall fitness.
Com_hourlyCaloriesIntensities <- left_join(hourly_intensities, hourly_calories, by = c("Id", "ActivityHour"))
# Categorize users based on BMI
weight_log <- weight_log %>%
mutate(bmi_status = case_when(
BMI < 18.5 ~ "Underweight",
BMI >= 18.5 & BMI <= 24.9 ~ "Healthy weight",
BMI >= 25 & BMI <= 29.9 ~ "Overweight",
BMI >= 30 ~ "Obese"
))
# Group and count users by BMI category
user_type_bmi <- weight_log %>%
group_by(bmi_status) %>%
summarise(count = n())
# Add percentage column
user_type_bmi <- user_type_bmi %>%
mutate(percentage = round((count / sum(count)) * 100))
# Pie chart
ggplot(user_type_bmi, aes(x = "", y = percentage, fill = bmi_status)) +
geom_bar(stat = "identity", linewidth = 1, color = "white") +
coord_polar("y") +
geom_text(aes(label = paste0(percentage, "%")), position = position_stack(vjust = 0.5)) +
labs(title = "BMI Status Distribution Among Users") +
scale_fill_brewer(palette = "Set3", name = "BMI Category") +
theme_classic() +
theme(axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
plot.title = element_text(hjust = 0.5, size = 14))