Step 1: Load Required Libraries¶

In this step, we load all the necessary R libraries that we will use throughout the analysis.

These include:

  • tidyverse for data manipulation and visualization
  • ggplot2 and ggpubr for creating plots and showing correlations
  • lubridate for handling date-time formats
  • janitor for cleaning column names
  • dplyr and tidyr for tidy data operations

Make sure to install any missing packages using install.packages() before loading them.

# Install missing packages (run only if needed)
In [2]:
install.packages("tidyverse")
install.packages("ggpubr")
install.packages("lubridate")
install.packages("janitor")
Warning message:
"package 'tidyverse' is in use and will not be installed"
Warning message:
"package 'ggpubr' is in use and will not be installed"
Warning message:
"package 'lubridate' is in use and will not be installed"
Warning message:
"package 'janitor' is in use and will not be installed"
In [1]:
# Load core packages
library(tidyverse)    # For data manipulation and visualization
library(ggplot2)      # For plotting
library(ggpubr)       # For correlation and advanced plots
library(tidyr)        # For tidy data formatting
library(dplyr)        # For wrangling
library(lubridate)    # For working with date-time
library(janitor)      # To clean column names
Warning message:
"package 'tidyverse' was built under R version 4.3.3"
Warning message:
"package 'ggplot2' was built under R version 4.3.3"
Warning message:
"package 'tidyr' was built under R version 4.3.3"
Warning message:
"package 'readr' was built under R version 4.3.3"
Warning message:
"package 'purrr' was built under R version 4.3.1"
Warning message:
"package 'dplyr' was built under R version 4.3.3"
Warning message:
"package 'forcats' was built under R version 4.3.1"
Warning message:
"package 'lubridate' was built under R version 4.3.3"
── Attaching core tidyverse packages ───────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Warning message:
"package 'janitor' was built under R version 4.3.3"

Attaching package: 'janitor'


The following objects are masked from 'package:stats':

    chisq.test, fisher.test


Step 2: Import the Datasets¶

In this step, we load all the Fitbit data files collected between March 12, 2016 and April 11, 2016. These include daily, hourly, and minute-level datasets covering activities, calories, steps, sleep, heart rate, and weight logs.

⚠️ Make sure the file paths are correctly set based on your local file system.

In [4]:
# Daily data
daily_activity <- read_csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/dailyActivity_merged.csv")
daily_calories <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/dailyCalories_merged.csv")
daily_intensities <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/dailyIntensities_merged.csv")
daily_steps <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/dailySteps_merged.csv")

# Hourly data
hourly_calories <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/hourlyCalories_merged.csv")
hourly_intensities <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/hourlyIntensities_merged.csv")
hourly_steps <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/hourlySteps_merged.csv")

# Minute-level data
minute_calories <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/minuteCaloriesNarrow_merged.csv")
minute_intensities <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/minuteIntensitiesNarrow_merged.csv")
minute_MET <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/minuteMETsNarrow_merged.csv")
minute_sleep <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/minuteSleep_merged.csv")
minute_steps <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/minuteStepsNarrow_merged.csv")

# Other data
heartrate_seconds <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/heartrate_seconds_merged.csv")
day_sleep <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/sleepDay_merged.csv")
weight_log <- read.csv("C:/Users/Soma/Desktop/Coursera-Projects/Case Study 2/Fitabase Data 3.12.16-4.11.16/Fitbits/weightLogInfo_merged.csv")
Rows: 940 Columns: 15
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): ActivityDate
dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Step 3: Cleaning and Inspecting the Data¶

Now that we have imported the datasets, we will:

  • Inspect the structure of the datasets
  • Convert date columns to proper date formats
  • Identify and remove duplicates
  • Check for missing values (NAs)
In [6]:
### 🧹 Clean and inspect key datasets:

# Preview the daily activity data
head(daily_activity)

# Check structure to identify incorrect data types (e.g., ActivityDate as character)
str(daily_activity)

# Convert date columns to POSIXct or Date format
daily_activity$ActivityDate <- as.POSIXct(daily_activity$ActivityDate, format = "%m/%d/%Y")
daily_calories$ActivityDay <- as.POSIXct(daily_calories$ActivityDay, format = "%m/%d/%Y")
daily_intensities$ActivityDay <- as.POSIXct(daily_intensities$ActivityDay, format = "%m/%d/%Y")
daily_steps$ActivityDay <- as.POSIXct(daily_steps$ActivityDay, format = "%m/%d/%Y")
heartrate_seconds$Time <- as.POSIXct(heartrate_seconds$Time, format = "%m/%d/%Y %I:%M:%S %p", tz = Sys.timezone())
hourly_calories$ActivityHour <- as.POSIXct(hourly_calories$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p", tz = Sys.timezone())
hourly_intensities$ActivityHour <- as.POSIXct(hourly_intensities$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p", tz = Sys.timezone())
hourly_steps$ActivityHour <- as.POSIXct(hourly_steps$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p", tz = Sys.timezone())
A tibble: 6 × 15
IdActivityDateTotalStepsTotalDistanceTrackerDistanceLoggedActivitiesDistanceVeryActiveDistanceModeratelyActiveDistanceLightActiveDistanceSedentaryActiveDistanceVeryActiveMinutesFairlyActiveMinutesLightlyActiveMinutesSedentaryMinutesCalories
<dbl><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
15039603664/12/2016131628.508.5001.880.556.0602513328 7281985
15039603664/13/2016107356.976.9701.570.694.7102119217 7761797
15039603664/14/2016104606.746.7402.440.403.910301118112181776
15039603664/15/2016 97626.286.2802.141.262.8302934209 7261745
15039603664/16/2016126698.168.1602.710.415.0403610221 7731863
15039603664/17/2016 97056.486.4803.190.782.5103820164 5391728
spc_tbl_ [940 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
 $ ActivityDate            : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
 $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
 $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
 $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
 $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
 $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
 $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
 $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
 $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
 $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
 $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
 $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
 $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
 $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...
 - attr(*, "spec")=
  .. cols(
  ..   Id = col_double(),
  ..   ActivityDate = col_character(),
  ..   TotalSteps = col_double(),
  ..   TotalDistance = col_double(),
  ..   TrackerDistance = col_double(),
  ..   LoggedActivitiesDistance = col_double(),
  ..   VeryActiveDistance = col_double(),
  ..   ModeratelyActiveDistance = col_double(),
  ..   LightActiveDistance = col_double(),
  ..   SedentaryActiveDistance = col_double(),
  ..   VeryActiveMinutes = col_double(),
  ..   FairlyActiveMinutes = col_double(),
  ..   LightlyActiveMinutes = col_double(),
  ..   SedentaryMinutes = col_double(),
  ..   Calories = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
In [8]:
# Check for duplicates in key datasets
sum(duplicated(daily_activity))
sum(duplicated(daily_calories))
sum(duplicated(daily_intensities))
sum(duplicated(daily_steps))
sum(duplicated(day_sleep))
sum(duplicated(heartrate_seconds))
sum(duplicated(hourly_calories))
sum(duplicated(hourly_intensities))
sum(duplicated(hourly_steps))
sum(duplicated(weight_log))

# Remove duplicates if found
day_sleep <- distinct(day_sleep)
0
0
0
0
0
0
0
0
0
0
In [19]:
# Check for missing values
sum(is.na(daily_activity))
sum(is.na(daily_calories))
sum(is.na(daily_intensities))
sum(is.na(daily_steps))
sum(is.na(day_sleep))
sum(is.na(heartrate_seconds))
sum(is.na(weight_log))
0
0
0
0
0
0
0
In [15]:
head(weight_log)
A data.frame: 6 × 7
IdDateWeightKgWeightPoundsBMIIsManualReportLogId
<dbl><chr><dbl><dbl><dbl><chr><dbl>
115039603665/2/2016 11:59:59 PM 52.6115.963122.65True 1.462234e+12
215039603665/3/2016 11:59:59 PM 52.6115.963122.65True 1.462320e+12
319279722794/13/2016 1:08:52 AM 133.5294.317147.54False1.460510e+12
428732127654/21/2016 11:59:59 PM 56.7125.002121.45True 1.461283e+12
528732127655/12/2016 11:59:59 PM 57.3126.324921.69True 1.463098e+12
643197035774/17/2016 11:59:59 PM 72.4159.614727.45True 1.460938e+12
In [18]:

A matrix: 67 × 7 of type int
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
⋮⋮⋮⋮⋮⋮⋮
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
In [21]:
# Identify specific columns in weight_log with missing values
colSums(is.na(weight_log))

# Drop the 'Fat' column due to excessive missing values
weight_log <- select(weight_log, -Fat)

# Confirm the column is removed
head(weight_log)
Id
0
Date
0
WeightKg
0
WeightPounds
0
BMI
0
IsManualReport
0
LogId
0
A data.frame: 6 × 7
IdDateWeightKgWeightPoundsBMIIsManualReportLogId
<dbl><chr><dbl><dbl><dbl><chr><dbl>
115039603665/2/2016 11:59:59 PM 52.6115.963122.65True 1.462234e+12
215039603665/3/2016 11:59:59 PM 52.6115.963122.65True 1.462320e+12
319279722794/13/2016 1:08:52 AM 133.5294.317147.54False1.460510e+12
428732127654/21/2016 11:59:59 PM 56.7125.002121.45True 1.461283e+12
528732127655/12/2016 11:59:59 PM 57.3126.324921.69True 1.463098e+12
643197035774/17/2016 11:59:59 PM 72.4159.614727.45True 1.460938e+12

Step 4: Explore User Coverage and Most Tracked Metrics¶

Let’s now explore how many unique users are represented in each dataset. This helps us understand which activities or measurements are most commonly tracked by users.

We’ll:

  • Count distinct user IDs (Id) in each dataset
  • Visualize the number of users per dataset
  • Add short labels for easier plotting
In [22]:
# Define metric names and count of unique user IDs
tracked_metrics <- c("daily_activity", "daily_calories", "daily_intensities", 
                     "daily_steps", "sleep", "heartrate_seconds", "weight")

unique_ids <- c(n_distinct(daily_activity$Id),
                n_distinct(daily_calories$Id),
                n_distinct(daily_intensities$Id),
                n_distinct(daily_steps$Id),
                n_distinct(day_sleep$Id),
                n_distinct(heartrate_seconds$Id),
                n_distinct(weight_log$Id))

# Create summary data frame
most_tracked_items <- data.frame(tracked_metrics, unique_ids)

# Add short labels for plotting
most_tracked_items$short_labels <- abbreviate(most_tracked_items$tracked_metrics, minlength = 5)

# Bar chart to visualize most tracked activities
ggplot(data = most_tracked_items, aes(x = short_labels, y = unique_ids)) +
  geom_bar(stat = "identity", fill = "saddlebrown") +
  geom_text(aes(label = unique_ids), vjust = 1.6, color = "white", size = 4) +
  labs(title = "The Most Tracked Activities Among Users",
       x = "Tracked Metrics",
       y = "Number of Users") +
  theme_minimal()
No description has been provided for this image

Step 5: Feature Engineering — Add Total Minutes Column and Activity Summary¶

In this step, we:

  • Create a new column total_minutes in the daily_activity dataset
  • Summarize active minutes and calories
  • Prepare for analyzing user behavior based on how long they wear the device
In [23]:
# Create a new column for total minutes worn per day
daily_activity <- daily_activity %>%
  mutate(total_minutes = VeryActiveMinutes + LightlyActiveMinutes + FairlyActiveMinutes + SedentaryMinutes)

# Preview the new column
head(daily_activity %>% select(Id, total_minutes, VeryActiveMinutes, LightlyActiveMinutes, FairlyActiveMinutes, SedentaryMinutes))

# Calculate average minutes worn per day by each user
avg_mins_worn <- daily_activity %>%
  group_by(Id) %>%
  summarise(average_mins = mean(total_minutes))

# Convert minutes to hours and calculate percentage of the day
avg_mins_worn <- avg_mins_worn %>%
  mutate(avg_hr_worn = average_mins / 60,
         percentage_of_day = (avg_hr_worn / 24) * 100)

# View summary
summary(avg_mins_worn)
A tibble: 6 × 6
Idtotal_minutesVeryActiveMinutesLightlyActiveMinutesFairlyActiveMinutesSedentaryMinutes
<dbl><dbl><dbl><dbl><dbl><dbl>
150396036610942532813 728
150396036610332121719 776
1503960366144030181111218
1503960366 9982920934 726
150396036610403622110 773
1503960366 7613816420 539
       Id             average_mins   avg_hr_worn    percentage_of_day
 Min.   :1.504e+09   Min.   : 911   Min.   :15.18   Min.   :63.26    
 1st Qu.:2.347e+09   1st Qu.:1035   1st Qu.:17.25   1st Qu.:71.88    
 Median :4.445e+09   Median :1323   Median :22.06   Median :91.91    
 Mean   :4.857e+09   Mean   :1224   Mean   :20.40   Mean   :85.02    
 3rd Qu.:6.962e+09   3rd Qu.:1419   3rd Qu.:23.64   3rd Qu.:98.52    
 Max.   :8.878e+09   Max.   :1439   Max.   :23.99   Max.   :99.94    

Extra Step: Quantile Analysis of Percentage of Time Device Worn¶

The following code calculates quantiles of the percentage_of_day variable, helping us understand how consistently users wore their Fitbit devices.

# Basic quantiles (min, 25th, 50th, 75th, max)
quantile(avg_mins_worn$percentage_of_day, probs = c(0, 0.25, 0.5, 0.75, 1))

# Specific percentiles (65th and 70th)
quantile(avg_mins_worn$percentage_of_day, probs = c(0.65, 0.70))
In [29]:
# Convert minutes to hours and calculate percentage of the day
avg_mins_worn <- avg_mins_worn %>%
  mutate(avg_hr_worn = average_mins / 60,
         percentage_of_day = (avg_hr_worn / 24) * 100)
In [30]:
head (avg_mins_worn)
A tibble: 6 × 4
Idaverage_minsavg_hr_wornpercentage_of_day
<dbl><dbl><dbl><dbl>
15039603661125.96818.7661378.19220
16245800811425.71023.7618399.00762
16444300811371.26722.8544495.22685
18445050721323.48422.0580691.90860
19279722791358.09722.6349594.31228
20224844081425.67723.7612999.00538
In [31]:
# Quantiles of percentage_of_day
quantile(avg_mins_worn$percentage_of_day, probs = c(0, 0.25, 0.5, 0.75, 1))
0%
63.2616487455197
25%
71.8794802867384
50%
91.9086021505376
75%
98.5208333333333
100%
99.937865497076

Quantile Analysis: Percentage of Time Device Was Worn¶

To understand how consistently users wore their Fitbit devices, we calculated the quantiles of the percentage_of_day variable — which represents how much of the day (in %) each user wore their device on average.

📊 Quantile Results:¶

Percentile Value (%) Interpretation
0% (Min) 63.26 The least active user wore the device ~63% of the day
25% 71.88 25% of users wore their device less than 71.9% of the day
50% (Median) 91.91 Half the users wore the device more than 91.9% of the day
75% 98.52 75% of users wore it less than 98.5%, while 25% wore it nearly all day
100% (Max) 99.94 The most consistent user wore the device almost 100% of the time

These results suggest that the majority of users were highly engaged, with over 50% wearing their devices more than 90% of the day.

This insight supports Bellabeat’s opportunity to build long-term engagement strategies based on user consistency.

Step 6: Sleep Pattern Analysis and Merging Datasets¶

In this step, we:

  • Merge daily_activity and day_sleep on Id and Date
  • Ensure both date columns have consistent formats
  • Create new variables: total sleep hours, time in bed, time to fall asleep
  • Categorize users into sleep pattern groups
In [32]:
# Rename date columns for merging
daily_activity <- daily_activity %>% rename(Date = ActivityDate)
day_sleep <- day_sleep %>% rename(Date = SleepDay)

# Ensure consistent date formats
daily_activity$Date <- as.Date(daily_activity$Date, format = "%m/%d/%Y")
day_sleep$Date <- as.Date(day_sleep$Date, format = "%m/%d/%Y %I:%M:%S %p")

# Merge datasets by Id and Date
combined_data <- left_join(daily_activity, day_sleep, by = c("Id", "Date"))

# Remove duplicates and unnecessary columns
combined_data <- distinct(combined_data)
combined_data <- select(combined_data, -LoggedActivitiesDistance)

# Filter out rows without sleep data
combined_data_filtered <- combined_data %>% filter(!is.na(TotalSleepRecords))

# Create new variables for sleep analysis
combined_data_filtered <- combined_data_filtered %>%
  mutate(Hourssleep = TotalMinutesAsleep / 60,
         Hoursinbed = TotalTimeInBed / 60,
         TofallAsleep = Hoursinbed - Hourssleep)

# Categorize users based on sleep duration
combined_data_filtered <- combined_data_filtered %>%
  mutate(SleepPatterns = case_when(
    Hourssleep <= 5.99 ~ "Less Sleep",
    Hourssleep >= 6 & Hourssleep <= 9.99 ~ "Enough Sleep",
    Hourssleep >= 10 ~ "More Sleep"
  ))
In [34]:
# Reorder SleepPatterns factor levels for logical display
Sleep_Patterns$SleepPatterns <- factor(Sleep_Patterns$SleepPatterns,
                                       levels = c("Less Sleep", "Enough Sleep", "More Sleep"))

# Improved color-mapped pie chart
ggplot(Sleep_Patterns, aes(x = "", y = Percentage, fill = SleepPatterns)) +
  geom_bar(stat = "identity", linewidth = 1, color = "gold") +
  coord_polar("y") +
  geom_text(aes(label = paste0(Percentage, "%")), position = position_stack(vjust = 0.5)) +
  labs(title = "Sleep Pattern Distribution Among Users") +
  scale_fill_manual(values = c("Less Sleep" = "#08306B", 
                               "Enough Sleep" = "#6BAED6", 
                               "More Sleep" = "#C6DBEF"),
                    name = "Daily Sleep Hours") +
  theme_classic() +
  theme(axis.line = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        plot.title = element_text(hjust = 0.5, size = 14))
No description has been provided for this image

Sleep Pattern Distribution¶

This pie chart presents the distribution of users by sleep duration, categorized into:

  • Less Sleep (≤ 6 hours) – shown in darkest blue
  • Enough Sleep (6–9.9 hours) – shown in medium blue
  • More Sleep (≥ 10 hours) – shown in lightest blue

By mapping color intensity to sleep quality, the visualization more intuitively highlights users who may benefit from improved sleep habits — a valuable insight for Bellabeat’s wellness campaigns.

Why This Visualization Is a Strong Addition¶

✅ Clear story in one view¶

  • The chart shows that 71% of users get "Enough Sleep",
  • 24% get "Less Sleep", and
  • only 5% get "More Sleep".
  • This highlights a clear behavioral pattern and sleep consistency among users.

✅ Supports marketing decisions¶

  • Bellabeat could focus on the 24% of users with less sleep, offering:
    • Wellness coaching
    • Sleep reminder notifications
    • Educational content about healthy sleep routines

✅ Good user experience (UX)¶

  • The pie chart is simple and clean
  • Percentages are clearly labeled
  • Color intensity aligns with sleep quality, making the chart intuitive and easy to read

Step 7: Correlation Analysis — Steps, Calories, and Intensity¶

In this step, we examine how physical activity metrics relate to calories burned. This includes:

  • Correlation between daily steps and calories
  • Correlation between total active minutes and calories
  • Visualizing these relationships with scatter plots and trend lines

We use the stat_cor() function from ggpubr to display Pearson correlation coefficients.

In [35]:
# Daily Steps vs. Calories
daily_activity %>%
  ggplot(aes(x = TotalSteps, y = Calories)) +
  geom_jitter(alpha = 0.5, color = "darkgreen") +
  geom_smooth(method = "lm", se = TRUE, color = "black") +
  stat_cor(method = "pearson") +
  labs(title = "Daily Steps vs. Calories Burned",
       x = "Total Daily Steps",
       y = "Calories Burned")

# Total Minutes Worn vs. Calories
daily_activity %>%
  ggplot(aes(x = total_minutes, y = Calories)) +
  geom_jitter(alpha = 0.5, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "black") +
  stat_cor(method = "pearson") +
  labs(title = "Total Minutes Worn vs. Calories Burned",
       x = "Total Minutes Worn per Day",
       y = "Calories Burned")
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
No description has been provided for this image
No description has been provided for this image

Insights: Steps vs. Calories Burned¶

The scatter plot above reveals a statistically significant positive correlation between the number of daily steps and calories burned (R = 0.59, p < 0.001).

🔍 Interpretation:¶

  • As users increase their step count, their calories burned also increase.
  • This trend suggests that walking regularly has a meaningful impact on daily energy expenditure.
  • The relationship appears fairly linear, especially between 2,000 and 15,000 steps per day.

💡 Implications for Bellabeat:¶

  • Reinforces the value of promoting daily step goals to users.
  • Bellabeat can create motivational programs or alerts encouraging consistent walking behavior to improve calorie burn and overall fitness.
In [38]:
Com_hourlyCaloriesIntensities <- left_join(hourly_intensities, hourly_calories, by = c("Id", "ActivityHour"))
In [39]:
# Categorize users based on BMI
weight_log <- weight_log %>%
  mutate(bmi_status = case_when(
    BMI < 18.5 ~ "Underweight",
    BMI >= 18.5 & BMI <= 24.9 ~ "Healthy weight",
    BMI >= 25 & BMI <= 29.9 ~ "Overweight",
    BMI >= 30 ~ "Obese"
  ))

# Group and count users by BMI category
user_type_bmi <- weight_log %>%
  group_by(bmi_status) %>%
  summarise(count = n())

# Add percentage column
user_type_bmi <- user_type_bmi %>%
  mutate(percentage = round((count / sum(count)) * 100))

# Pie chart
ggplot(user_type_bmi, aes(x = "", y = percentage, fill = bmi_status)) +
  geom_bar(stat = "identity", linewidth = 1, color = "white") +
  coord_polar("y") +
  geom_text(aes(label = paste0(percentage, "%")), position = position_stack(vjust = 0.5)) +
  labs(title = "BMI Status Distribution Among Users") +
  scale_fill_brewer(palette = "Set3", name = "BMI Category") +
  theme_classic() +
  theme(axis.line = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        plot.title = element_text(hjust = 0.5, size = 14))
No description has been provided for this image
In [ ]: