This R notebook contains my end-to-end work on the Cyclistic bike-share case study, performed as a “capstone” project for Google’s excellent Data Analytics certificate program on Coursera.
It is hosted in the following repository, where you can also access READMEs and a data changelog: https://github.com/dmallia17/CyclisticBike-ShareAnalysis
The Google program recommends a data analysis process of 6 steps: ask, prepare, process, analyze, share and act. Accordingly, this notebook is structured around those steps.
Please note that some inspiration for the below work comes from an R script, which was provided in the course and is based on this blog.
# Note this attaches ggplot2, lubridate, readr, stringr, dplyr and tidyr
library(tidyverse)
library(skimr)
library(naniar)
library(geosphere)
# Including this line to disable a "friendly" but unhelpful warning we get
# when grouping by multiple columns
# https://stackoverflow.com/questions/62140483/how-to-interpret-dplyr-message-summarise-regrouping-output-by-x-override
options(dplyr.summarise.inform=F)
# Set this to false if you want to download everything for the first time
DOWNLOADED <- TRUE
# Set this to true if you need to rewrite files (e.g. false is for when
# you just need to rerun chunks for output)
REWRITE <- FALSE
Let’s start by briefly spelling out the (fictional) scenario for this case study for some context. I am a junior data analyst (woohoo!) on the marketing analyst team at Cyclistic, a bike-sharing company in Chicago. Following the conclusion from the finance analysts at Cyclistic that annual memberships are more profitable than casual riders (who use single ride or day passes), the director of marketing wants to work on converting casual riders into Cyclistic members who have an annual membership. This is motivated by the idea that these casual consumers are already aware of Cyclistic, and may be drawn by the company’s offerings of alternative bike types suitable for those for disabilities. To support this initiative, the key question for my work to answer with data, is how do casual riders differ from those with annual memberships?
Make use of the company’s historical bike trip data to understand differences between casual riders and those with annual memberships, and thereby unlock insights and recommendations for a marketing program designed to convert casual riders into Cyclistic members.
Because Cyclistic is a fictional company, we are making use of Divvy trip datasets made publicly available here under this license. A quick glance at this data repository reveals that the data format has changed over time, with ride data offered first (in 2013) with a file for a full year, then in per quarter segments, and then currently by month. Moreover, the naming scheme has changed in accordance with some of these changes, and consequently the index is not in chronological order.
For purposes of this analysis, we will work with the most recent full year’s worth of data, the 2022 data. This requires downloading the 12 files which constitute the 2022 collection. We will store the original csv files in a subdirectory called “data”.
if(!DOWNLOADED) {
src_dir <- getwd()
dir.create("data")
setwd(paste0(src_dir,"/data"))
for(i in 1:12){
padded_i <- str_pad(i, 2, pad="0")
temp_url <- sprintf(
"https://divvy-tripdata.s3.amazonaws.com/2022%s-divvy-tripdata.zip",
padded_i)
destfile <- sprintf("2022%s-divvy-tripdata.zip", padded_i)
download.file(temp_url, destfile=destfile)
unzip(destfile)
file.remove(destfile)
file.remove("__MACOSX")
}
}
Even before opening a single file, we can make a couple of observations about the nature of the data we have to work with. This would constitute first party data as (in our fictional scenario) this is our company’s data. In a real-world scenario, we would ensure that we are retrieving the data directly from a company source and verifying with data engineers or other analysts, that this is the correct and up-to-date data. Nonetheless, we should treat this data as any other, and perform checks both statistical and visual to ensure data integrity and catch errors. Simple storage in a sub-directory for our analysis is fine for these purposes; as we combine data and produce a larger file for the full year’s analysis, we will want to follow good naming conventions and maintain a changelog.
We will be reviewing a full year’s worth of data, which should negate the bias that might arise from seasonal behavior (though we will want to consider seasonal trends in our analysis). Moreover, the data is from 2022, which is both recent (always important, but particularly the case for bike-sharing which is a more recent and growing trend) and after the major COVID shutdowns, so we should see behavior that is representative of the present behavior of Cyclistic customers. It is worth acknowledging that this data is, of course, limited to Cyclistic customers - and thereby, bike-sharing in a purely Chicago context - so we would need to be cautious in generalizing outside of the company or to other localities, but it should be appropriate given our task of considering differences between Cyclistic casual riders and members.
To dig further, it is time to actually open the files and begin to look for issues, and compile a file with the full year’s worth of data.
One of the files is named “publictripdata” instead of “tripdata”, but we can quickly make that consistent:
if(!DOWNLOADED){
file.rename("data/202209-divvy-publictripdata.csv",
"data/202209-divvy-tripdata.csv")
}
tripdata_by_month <- list()
for(i in 1:12){
padded_i <- str_pad(i, 2, pad="0")
csv_name <- sprintf("data/2022%s-divvy-tripdata.csv", padded_i)
tripdata_by_month[[i]] <- read_csv(csv_name)
}
## Rows: 103770 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 115609 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 284042 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 371249 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 634858 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 769204 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 823488 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 785932 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 701339 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 558685 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 337735 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 181806 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The nice thing about reading in csv files with readr is we get a quick summary, giving rows, columns, the delimiter detected, the titles of columns and their detected types. Nonetheless, before we try to combine them into a single dataframe, let’s check all column names are identical.
# Check that all column names are identical
trip_colnames <- colnames(tripdata_by_month[[1]])
print(paste("All column names identical:",
all(sapply(tripdata_by_month, function(x) {identical(colnames(x), trip_colnames)}))))
## [1] "All column names identical: TRUE"
Let’s also take a look at the details of at least one month’s data. It would be best to do this for every month but for time and space considerations, plus the fact that we can perform some checks better in a visual fashion, we can just run it for January:
skim_without_charts(tripdata_by_month[[1]])
Name | tripdata_by_month[[1]] |
Number of rows | 103770 |
Number of columns | 13 |
_______________________ | |
Column type frequency: | |
character | 7 |
numeric | 4 |
POSIXct | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
ride_id | 0 | 1.00 | 16 | 16 | 0 | 103770 | 0 |
rideable_type | 0 | 1.00 | 11 | 13 | 0 | 3 | 0 |
start_station_name | 16260 | 0.84 | 10 | 53 | 0 | 758 | 0 |
start_station_id | 16260 | 0.84 | 3 | 44 | 0 | 758 | 0 |
end_station_name | 17927 | 0.83 | 10 | 51 | 0 | 724 | 0 |
end_station_id | 17927 | 0.83 | 3 | 44 | 0 | 724 | 0 |
member_casual | 0 | 1.00 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|---|
start_lat | 0 | 1 | 41.90 | 0.05 | 41.65 | 41.88 | 41.89 | 41.93 | 45.64 |
start_lng | 0 | 1 | -87.65 | 0.05 | -87.83 | -87.66 | -87.64 | -87.63 | -73.80 |
end_lat | 86 | 1 | 41.90 | 0.05 | 41.65 | 41.88 | 41.90 | 41.93 | 42.07 |
end_lng | 86 | 1 | -87.65 | 0.03 | -87.83 | -87.66 | -87.64 | -87.63 | -87.52 |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
started_at | 0 | 1 | 2022-01-01 00:00:05 | 2022-01-31 23:58:37 | 2022-01-16 11:50:04 | 100315 |
ended_at | 0 | 1 | 2022-01-01 00:01:48 | 2022-02-01 01:46:16 | 2022-01-16 12:04:35 | 100047 |
This output gives us an overall summary of the dataframe, and then statistics for the character type columns, numeric type columns and finally date type columns.
To inspect some values, let’s use glimpse:
glimpse(tripdata_by_month[[1]])
## Rows: 103,770
## Columns: 13
## $ ride_id <chr> "C2F7DD78E82EC875", "A6CF8980A652D272", "BD0F91DFF7…
## $ rideable_type <chr> "electric_bike", "electric_bike", "classic_bike", "…
## $ started_at <dttm> 2022-01-13 11:59:47, 2022-01-10 08:41:56, 2022-01-…
## $ ended_at <dttm> 2022-01-13 12:02:44, 2022-01-10 08:46:17, 2022-01-…
## $ start_station_name <chr> "Glenwood Ave & Touhy Ave", "Glenwood Ave & Touhy A…
## $ start_station_id <chr> "525", "525", "TA1306000016", "KA1504000151", "TA13…
## $ end_station_name <chr> "Clark St & Touhy Ave", "Clark St & Touhy Ave", "Gr…
## $ end_station_id <chr> "RP-007", "RP-007", "TA1307000001", "TA1309000021",…
## $ start_lat <dbl> 42.01280, 42.01276, 41.92560, 41.98359, 41.87785, 4…
## $ start_lng <dbl> -87.66591, -87.66597, -87.65371, -87.66915, -87.624…
## $ end_lat <dbl> 42.01256, 42.01256, 41.92533, 41.96151, 41.88462, 4…
## $ end_lng <dbl> -87.67437, -87.67437, -87.66580, -87.67139, -87.627…
## $ member_casual <chr> "casual", "casual", "member", "casual", "member", "…
We can see we have a ride id, a rideable type (the type of vehicle, such as “electric_bike” or “classic_bike”), the start and end times of rides (note that we actually have an end time in February, the “max” time for the January data), start and end station names and ids, start and end latitudes and longitudes, and whether or not the rider was a member or a “casual” rider. Of course, we also have the month column we added in earlier.
We can quickly check the unique values for rideable_type and member_casual:
print(unique(tripdata_by_month[[1]]$rideable_type))
## [1] "electric_bike" "classic_bike" "docked_bike"
print(unique(tripdata_by_month[[1]]$member_casual))
## [1] "casual" "member"
NOTE Two other items jump out here:
We have missing values for station names and ids, as well as trip end latitude and longtidue…
We have 103770 rows and 103770 unique ride ids, so all seems well for January, but we should check this holds for all other months.
For our final step, let’s combine the dataframes into one:
all_tripdata <- bind_rows(tripdata_by_month)
We can save this result as a file, so we don’t need to repeat these steps in the future.
if(REWRITE) {
write_csv(all_tripdata, "2022-divvy-tripdata-V01.csv")
}
Now we can begin work on all of our 2022 data, getting some sense of the nature and quality of the data, extracting additional features (columns) and cleaning data as need be.
First, we can reload the data, if needed.
defined <- ls()
if(!("all_tripdata" %in% defined)) {
all_tripdata <- read_csv("2022-divvy-tripdata-V01.csv")
}
Next, let’s take a quick look at the top or head of the dataset for a refresh on content.
head(all_tripdata)
From this we can see we have 13 columns, how about the number of rows for 2022 combined? Again, we can use skim_without_charts to check this and retrieve a lot more information all in one call.
skim_without_charts(all_tripdata)
Name | all_tripdata |
Number of rows | 5667717 |
Number of columns | 13 |
_______________________ | |
Column type frequency: | |
character | 7 |
numeric | 4 |
POSIXct | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
ride_id | 0 | 1.00 | 16 | 16 | 0 | 5667717 | 0 |
rideable_type | 0 | 1.00 | 11 | 13 | 0 | 3 | 0 |
start_station_name | 833064 | 0.85 | 7 | 64 | 0 | 1674 | 0 |
start_station_id | 833064 | 0.85 | 3 | 44 | 0 | 1313 | 0 |
end_station_name | 892742 | 0.84 | 9 | 64 | 0 | 1692 | 0 |
end_station_id | 892742 | 0.84 | 3 | 44 | 0 | 1317 | 0 |
member_casual | 0 | 1.00 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|---|
start_lat | 0 | 1 | 41.90 | 0.05 | 41.64 | 41.88 | 41.90 | 41.93 | 45.64 |
start_lng | 0 | 1 | -87.65 | 0.03 | -87.84 | -87.66 | -87.64 | -87.63 | -73.80 |
end_lat | 5858 | 1 | 41.90 | 0.07 | 0.00 | 41.88 | 41.90 | 41.93 | 42.37 |
end_lng | 5858 | 1 | -87.65 | 0.11 | -88.14 | -87.66 | -87.64 | -87.63 | 0.00 |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
started_at | 0 | 1 | 2022-01-01 00:00:05 | 2022-12-31 23:59:26 | 2022-07-22 15:03:59 | 4745862 |
ended_at | 0 | 1 | 2022-01-01 00:01:48 | 2023-01-02 04:56:45 | 2022-07-22 15:24:44 | 4758633 |
It looks like all told, we have over 5 million trips to analyze! That’s a nice sample size, one which would likely be overwhelming for spreadsheet software (Excel / Google Sheets) so it’s a good thing we are using R (though SQL would also be a good option here).
Let’s look over the results column by column to get to know the data better.
These are 16 character long ids, and it there are (thankfully) as many unique ids as there are rows in the data.
These give us the actual type of vehicles used; let’s see a quick breakdown by type.
table(all_tripdata$rideable_type)
##
## classic_bike docked_bike electric_bike
## 2601214 177474 2889029
Classic and electric bikes clearly make up the vast majority of the data. The Divvy website doesn’t explain what a docked bike is, but it sounds like this may not be a normal ride? We will revisit this below.
Per the names, these are the names (strings) of stations and the ids assigned to them. Around 15% of the data is missing here; if we peek ahead, we can see that we have the majority of latitude and longitude data, so it is certainly not a good idea to simply discard rows missing these values, particularly when, if desired, we may be able to recover the correct values using latitude and longitude. The frequency of missingness seems consistent across name and id, so if we’re missing one, we are missing the other, but we can also check that to be 100% sure.
print(paste("Number missing both start name and id:",
nrow(all_tripdata %>%
filter(is.na(start_station_name) & is.na(start_station_id)))))
## [1] "Number missing both start name and id: 833064"
print(paste("Number missing both end name and id:",
nrow(all_tripdata %>%
filter(is.na(end_station_name) & is.na(end_station_id)))))
## [1] "Number missing both end name and id: 892742"
Interestingly, there are considerably more names than ids, and a few more end ids than start ids. If we have time, we can revisit this to see if we have misspelled (and therefore duplicated) station names, but this assumes we want to make use of the names.
This identifies whether or not the ride was by a member or a casual rider (hence only 2 unique values); let’s check the breakdown.
table(all_tripdata$member_casual)
##
## casual member
## 2322032 3345685
prop.table(table(all_tripdata$member_casual))
##
## casual member
## 0.4096944 0.5903056
Excellent, while the data does skew in favor of members (~60/40), we also have quite a sample of casual riders.
What these offer is no surprise. As mentioned above, there is a small amount of missingness for end locations; once we have considered other information we can extract, we’ll likely want to come back and remove any remaining such cases given that we want full rides and that there are so few cases.
Again, per the name, these give us full datetime (year, month, day, hour, minute, second) for a ride’s start and stop. As we saw with the January data, the max value actually exceeds 2022, as least one person rode into the new year! Thankfully, there is no missingness here; this is great news given that we can make use of these columns to calculate ride durations (not present in these files). We can also unpack these further to ask questions about daily or monthly ride patterns.
Before we calculate additional columns, we can get a quick snapshot of what the missingness looks like in the data, and then revisit it once we’ve executed any deletion based on missingness or other problematic values. To do this, we can use the wonderful naniar package. Note that there is a known issue - also see here - with naniar producing an extra blank plot in the previewed html output, so we should only pay attention to the second plot.
gg_miss_upset(all_tripdata, nsets=6)
This captures in a nice visual fashion what we saw above and more: station names and ids are always missing together (in the sense of both start or end station name and id will be missing, not just name or id). Some cases have just starts missing, others ends, some both, and there is a set of almost six thousand cases where the end station information will be unrecoverable, given that latitude and longitude information is also missing.
Now we can add in additional columns based on the datetime columns. We don’t need to account for year, as this is all 2022 data. In addition to drawing on inspiration from the blog mentioned at the top of this notebook for generating new columns, this blog post inspired me to also include hours, so we can consider riding habits during different parts of the day.
all_tripdata <- all_tripdata %>%
mutate(
ride_datetime = as_datetime(started_at),
month = month(ride_datetime, label=T),
day = day(ride_datetime),
day_name = wday(ride_datetime, label=T),
hour = hour(ride_datetime)
)
head(all_tripdata)
And let’s add ride durations:
# Note for subsequent work, we should convert the results of difftime, instances
# of class "difftime" to numeric; specifically we'll use integers as we don't
# need fractions of seconds, and can then easily check the number of zero
# length trips
all_tripdata <- all_tripdata %>%
mutate(ride_length = as.integer(difftime(ended_at,started_at), units="secs"))
head(all_tripdata)
Now let’s perform a couple of checks and start doing some cleaning. Calling summary on ride_length gives us a place to start.
summary(all_tripdata %>% select(ride_length))
## ride_length
## Min. :-621201
## 1st Qu.: 349
## Median : 617
## Mean : 1167
## 3rd Qu.: 1108
## Max. :2483235
Clearly there are negative durations present, which should not be preserved. Communication with other data engineers or analysts may be able to resolve where these came from and prevent them in the future, but for now we will want to see how many such cases are present. Moreover, it stands to reason that there are likely also zero length cases, and, given that this data has these errors (which are described as having been removed on Divvy’s website), then we should also check for another issue they address: trips less than 60 seconds long.
print(paste("Number of rows with negative ride length:",
nrow(all_tripdata %>% filter(ride_length < 0))))
## [1] "Number of rows with negative ride length: 100"
print(paste("Number of rows with empty ride length:",
nrow(all_tripdata %>% filter(ride_length == 0))))
## [1] "Number of rows with empty ride length: 431"
print(paste("Number of rows with positive ride length less than 60 seconds:",
nrow(all_tripdata %>% filter(ride_length > 0 & ride_length < 60))))
## [1] "Number of rows with positive ride length less than 60 seconds: 120558"
The blog which is of some inspiration to this analysis mentions a separate but related issue, where some rides start at station “HQ QR”, when the ride was in fact an inspection. However, the below shows that there are no such cases in the 2022 data
nrow(all_tripdata %>% filter(start_station_name == "HQ QR"))
## [1] 0
Let’s delete rows with bad trip durations, creating a new dataframe with the result:
start_count <- nrow(all_tripdata)
all_tripdata_v2 <- all_tripdata %>%
filter(ride_length >= 60)
print(paste("Removed", start_count - nrow(all_tripdata_v2), "rows"))
## [1] "Removed 121089 rows"
print(paste("Number of rows now missing end lat/lon:",
nrow(all_tripdata_v2 %>%
filter(is.na(end_lat) & is.na(end_lng)))))
## [1] "Number of rows now missing end lat/lon: 5856"
Interestingly, this shows us that focusing on duration alone did not address the cases with missing end latitude and longitude. Let’s remove those now:
start_count <- nrow(all_tripdata_v2)
all_tripdata_v2 <- all_tripdata_v2 %>%
filter(!(is.na(end_lat) & is.na(end_lng)))
print(paste("Removed", start_count - nrow(all_tripdata_v2), "rows"))
## [1] "Removed 5856 rows"
Checking the distribution of rideable_type after these removals shows that “docked_bike” does not in fact correspond to an erroneous ride type. In fact, this blog suggests “docked_bike” might be an older name for “classic_bike”. Moreover, the number of these instances is comparatively small, so we will leave them as is.
table(all_tripdata_v2$rideable_type)
##
## classic_bike docked_bike electric_bike
## 2559448 173344 2807980
Finally, let’s revisit the missingness overview.
gg_miss_upset(all_tripdata_v2, nsets=6)
Here we write out the intermediate result after our processing and cleaning.
if(REWRITE) {
write_csv(all_tripdata_v2, "2022-divvy-tripdata-V02.csv")
}
In addition, for resuming our work in this notebook it is best to save out the data as an RDS object. Otherwise, if we try to reload the data only from csv, we will need to do extra work to convert day_name and month back into ordered form, as they will be interpreted as just characters.
if(REWRITE) {
saveRDS(all_tripdata_v2, "2022-divvy-tripdata-V02.RDS")
}
Work on the analysis phase naturally reveals additional problems missed during initial processing, as well as additional needs. To improve on the initial processing, let’s do the following, beginning with reading in the data.
all_tripdata_v3 <- readRDS("2022-divvy-tripdata-V02.RDS")
First, the number of the day in the month was initially included for considering trends evolving over the course of a month, but this is unlikely to yield any significant patterns, so we can drop day for now. Second, analyzing trends by hour requires using both start and end hours, so let’s rename hour to the more accurate indicator of start_hour and also include an end hour. In a similar vein, ride_datetime was an extraneous column as started_at and ended_at were already interpreted as datetimes, and we should also rename our day_name and month columns to indicate that these are for the start of the ride. Third, initially we left ride duration in seconds, but consider the following.
summary(all_tripdata_v3$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 60 364 630 1001 1122 2061244
Examining these numbers requires a lot of mental arithmetic. Now, in minutes, broken down by rider type:
all_tripdata_v3 %>%
mutate(ride_length=ride_length/60) %>%
group_by(member_casual) %>%
summarize(min=min(ride_length), max=max(ride_length), mean=mean(ride_length),
median=median(ride_length))
This is a lot easier to read but the max for casual rides clearly evidences a problem of outliers. There are 1440 minutes in a day, so the above shows us that the max any member “rode” was for a little over a day; for casual riders the max is around 24 days… Let’s check how many rides overall were for casual members, and then how many of those exceeded the maximum member ride duration. We can check the number of cases exceeding a day.
total_casual <- nrow(all_tripdata_v3 %>% filter(member_casual == "casual"))
total_casual_exceeding_member_max <- nrow(
all_tripdata_v3 %>%
filter(member_casual == "casual" & (ride_length / 60.0) > 1500))
total_casual
## [1] 2268975
total_casual_exceeding_member_max
## [1] 66
Out of over 2 million casual rides, only 66 exceeded the member max (a little over a day). Thus, for our fourth change, we drop these comparatively few cases that are unlikely to tell us much about casual riders, capping duration at 1500 minutes (a little over the observed maximum member ride duration).
The below accomplishes these 4 changes in what is basically a one-liner. dyplr is pretty neat!
start_count <- nrow(all_tripdata_v3)
all_tripdata_v3 <- all_tripdata_v3 %>%
select(-c(day, ride_datetime)) %>%
rename(start_hour=hour, start_day_name=day_name, start_month=month) %>%
mutate(end_hour = hour(ended_at),
ride_len_mins = ride_length / 60.0) %>%
select(-ride_length) %>%
filter(ride_len_mins <= 1500)
print(paste("Removed", start_count - nrow(all_tripdata_v3), "rows"))
## [1] "Removed 66 rows"
head(all_tripdata_v3)
An additional change needed for considering longitude and latitude data is removing erroneous values in these columns. Previously we discarded cases with missing values, but a closer look back at the ranges of values these take on reveals problems even where we “have” values:
all_tripdata_v3 %>%
select(start_lat,end_lat,start_lng, end_lng) %>%
skim_without_charts()
Name | Piped data |
Number of rows | 5540706 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|---|
start_lat | 0 | 1 | 41.90 | 0.05 | 41.64 | 41.88 | 41.90 | 41.93 | 45.64 |
end_lat | 0 | 1 | 41.90 | 0.07 | 0.00 | 41.88 | 41.90 | 41.93 | 42.37 |
start_lng | 0 | 1 | -87.65 | 0.03 | -87.84 | -87.66 | -87.64 | -87.63 | -73.80 |
end_lng | 0 | 1 | -87.65 | 0.10 | -88.14 | -87.66 | -87.64 | -87.63 | 0.00 |
We have a zero as the the minimum end latitude and as the maximum end longitude (per the maps::map() documentation, longitudes to the west of Greenwich are negative); the maximum start latitude is a line north of Illinois and the maximum start longitude is east of Illinois. Without more precise knowledge of the specific extent of Divvy’s network other than the map provided here, let’s check how many cases violate the boundaries of Illinois, referring to the bounding box information here:
y_lat_min <- 36.970298
y_lat_max <- 42.508481
x_lng_min <- -91.513079
x_lng_max <- -87.494756
violating_case_num <- nrow(all_tripdata_v3 %>%
filter(start_lng < x_lng_min | end_lng < x_lng_min |
start_lng > x_lng_max | end_lng > x_lng_max |
start_lat < y_lat_min | end_lat < y_lat_min |
start_lat > y_lat_max | end_lat > y_lat_max))
violating_case_num
## [1] 12
Thankfully we only have 12 cases, and can discard these without much concern.
start_count <- nrow(all_tripdata_v3)
all_tripdata_v3 <- all_tripdata_v3 %>%
filter(!(start_lng < x_lng_min | end_lng < x_lng_min |
start_lng > x_lng_max | end_lng > x_lng_max |
start_lat < y_lat_min | end_lat < y_lat_min |
start_lat > y_lat_max | end_lat > y_lat_max))
print(paste("Removed", start_count - nrow(all_tripdata_v3), "rows"))
## [1] "Removed 12 rows"
With these columns in better shape, we can now create a column for distance based on calculations with the start and end coordinates. To do this we can use the geosphere package’s distGeo for an accurate distance calculation between starting and ending longitude, latitude pairs; results are given in meters. These distances will be somewhat unreliable for a couple of reasons:
Getting these calculations to work were very tricky, as this a row-wise calculation, where 2 groups of 2 columns each need to be formed, so accomplishing it proved very difficult. I am leaving some prior attempts which either exhausted memory or took too long until I found this very helpful stack overflow thread which seems to have started in a similar place and the author found a solution which had not occurred to me but reduced processing time from endless minutes to only a few seconds.
# EXHAUSTS MEMORY
# all_tripdata_v3$ride_dist_meters <- distm(
# cbind(all_tripdata_v3$start_lng, all_tripdata_v3$start_lat),
# cbind(all_tripdata_v3$end_lng, all_tripdata_v3$end_lat)
# )[1,]
# Runs "indefinitely"
# all_tripdata_v3 <- all_tripdata_v3 %>%
# rowwise() %>%
# mutate(ride_dist_meters = distGeo(
# c_across(c(start_lng, start_lat)), c_across(c(end_lng, end_lat))))
# Encounters a vector length error; postscript - this is because this would
# quite literally concatenate the 2 columns end after end for 1 very long
# vector!
# all_tripdata_v3 <- all_tripdata_v3 %>%
# mutate(ride_dist_meters=distGeo(
# c(start_lng, start_lat), c(end_lng, end_lat)))
all_tripdata_v3 <- all_tripdata_v3 %>%
mutate(ride_dist_meters = distGeo(
cbind(start_lng, start_lat), cbind(end_lng, end_lat)
))
Now we can write out this refined dataset.
if(REWRITE) {
write_csv(all_tripdata_v3, "2022-divvy-tripdata-V03.csv")
saveRDS(all_tripdata_v3, "2022-divvy-tripdata-V03.RDS")
}
Again, we can reload the data, if needed.
defined <- ls()
if(!("final_data" %in% defined)) {
final_data <- readRDS("2022-divvy-tripdata-V03.RDS")
}
Now that we have our cleaned data, let’s first look at the numbers and proportions of rides by rider types:
table(final_data$member_casual)
##
## casual member
## 2268904 3271790
prop.table(table(final_data$member_casual))
##
## casual member
## 0.4094982 0.5905018
Thankfully the deletions executed in the cleaning phase have not fundamentally altered the proportions of rides.
As a first step, we can consider the minimum, maximum, mean and median ride lengths. Let’s look at this overall…
summary(final_data$ride_len_mins)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 6.067 10.500 16.605 18.700 1499.950
…and now broken down by casual riders vs. members.
final_data %>%
group_by(member_casual) %>%
summarize(min=min(ride_len_mins), max=max(ride_len_mins), mean=mean(ride_len_mins),
median=median(ride_len_mins))
The fact that the mean is significantly greater than the median overall and for both groups suggests there is considerable rightward skew, with the mean being inflated by outlier long rides.
Inspecting this visually is tricky even accounting for major outliers in cleaning, given the very broad range of values. One option is to constrain the range further to get a better sense of distributions.
final_data %>%
filter(ride_len_mins <= 120) %>%
ggplot(aes(x=ride_len_mins)) + geom_density(aes(fill=member_casual), alpha=0.5)
This plot does confirm that the distribution for members has a sharper peak covering the range of several minute rides, whereas casual riders clearly skew towards longer rides. We can hypothesize that this may reflect ride motivations: members may be primarily using rides for short trips to and from work, or social gatherings, whereas more casual riders may be using bikes for longer outings (an end in themselves).
A quick starting place for understanding behavior by day is to examine the frequency of rides by day, broken into casual and member rides. NOTE: this plot and others like it below require careful interpretation, given that the distribution of member rides to casual rides is not 50/50, but in fact closer to 60/40. That said, we can still get a feeling for the different distributions side by side.
ggplot(final_data, aes(x=start_day_name)) +
geom_bar(aes(fill=member_casual), position="dodge")
The above gives us some indication that more casual rides occur on weekends than weekdays; members rides are somewhat more evenly distributed but do emphasize weekday use. Again, this does suggest difference of reasons for usage.
Let’s look at the fine-toothed breakdown of statistics by day. It turns out that the minimum allowed (1 minute) occurs every day of the week, and values close to the max (again we limit our analysis to the max member ride length), so let’s just retrieve mean and median. We group by start_day_name first so as to compare values day by day.
final_data %>%
group_by(start_day_name, member_casual) %>%
summarize(mean(ride_len_mins), median(ride_len_mins))
We can also examine this visually (let’s just examine median):
final_data %>%
group_by(start_day_name, member_casual) %>%
summarize(median=median(ride_len_mins)) %>%
ggplot(aes(x=start_day_name, y=median)) +
geom_col(aes(fill=member_casual), position="dodge")
Overall, the actual per-ride statistics conform to the picture given by ride counts per day (as well as the raw ride statistics we examined before), where overall casual users go for longer rides and members and casual riders differ on how they behave by day. Member ride statistics don’t vary as much day by day (though durations do increase a little on weekends, perhaps when rides can be more leisurely), whereas casual rides do shorten by multiple minutes (e.g. consider Sunday casual median of ~15 minutes vs. Tuesday casual median of ~12 minutes). Geographic and distance traveled information may help provide more information on what is happening in the casual group, but this does suggest that there may be multiple groups within casual riders:
a (perhaps small) group that are more likely to ride on any day of the week and who are responsible for the shorter, consistent behaviors on weekdays;
and a likely larger group who primarily ride for leisure on weekends. The first group would be the ideal targets of a campaign for converting casual riders to members!
Again, for monthly behavior as for daily behavior, we can by visualizing the counts of rides by month (heeding the aforementioned warning about the proportions of casual rides to member rides).
ggplot(final_data, aes(x=start_month)) +
geom_bar(aes(fill=member_casual), position="dodge")
Unsurprisingly, the months of December, January and February are noticeably less popular for members and casual riders, as these are winter months in North America, and many will seek other options for transportation in the face of low temperatures, snow and ice. Warmer and safer weather months certainly see a lot more bike usage, seeing increases by a factor of 4+. Yet, members again seem to evidence more consistency (as with their ride durations and use across the week), having a less peaked distribution and in fact the months with maximum use for are later than for casual riders.
Let’s consider statistics by month:
final_data %>%
group_by(start_month, member_casual) %>%
summarize(mean=mean(ride_len_mins), median=median(ride_len_mins))
Again, we can also visualize this (for the median):
final_data %>%
group_by(start_month, member_casual) %>%
summarize(median=median(ride_len_mins)) %>%
ggplot(aes(x=start_month, y=median)) +
geom_col(aes(fill=member_casual), position="dodge")
As with ridership by day, it seems there are two groups of riders among casual rides, and this is worth mining further.
Days were an intuitive starting place, and months give us a higher level overview, but we can also zoom in further to consider hours of usage. It is an interesting avenue to mine, but is somewhat difficult to review and plot, given that we want to consider counts by start and end hour pairs, and our analysis should be flexible enough to consider where rides may exceed the boundary of an hour.
One way to examine hours of usage is to make a plot counting the frequency for each combination of a ride start hour and end hour. We can easily create this 24x24 grid using ggplot’s geom_count and this has the advantage over a heatmap approach in that we can make use of both size and color here: in the below we will use size to indicates raw count, and color to capture proportion. We make separate plots for casual rides and member rides, as otherwise the plots will become too small to interpret. This stackoverflow thread was helpful in identifying how to set up our axes.
# Some setup for nice axis labels
breaks <- seq(0,23,3) # Breaks to display on the axis
hrs <- c(12,3,6,9)
# Create 12hr clock labels
labels <- c(paste0(hrs, "AM"), paste0(hrs, "PM"))
final_data %>%
filter(member_casual == "casual") %>%
mutate(end_hour = hour(ended_at)) %>%
ggplot(aes(x=start_hour, y=end_hour)) +
geom_count(aes(color=after_stat(prop))) +
scale_x_continuous(breaks = breaks, labels=labels) +
scale_y_continuous(breaks = breaks, labels=labels) +
labs(title="Counts for Casual Rides")
The above does require some careful reading and interpretation. Moreover, it is complicated by the fact that you can have rides that last for multiple hours before midnight (all values above the y=x line) or even rides which carry over into the subsequent day (all values below the y=x line). Nevertheless, the plot gives a fascinating window into hourly trends:
final_data %>%
filter(member_casual == "member") %>%
mutate(end_hour = hour(ended_at)) %>%
ggplot(aes(x=start_hour, y=end_hour)) +
geom_count(aes(color=after_stat(prop))) +
scale_x_continuous(breaks = breaks, labels=labels) +
scale_y_continuous(breaks = breaks, labels=labels) +
labs(title="Counts for Member Rides")
The story for members has some clear differences (though as with many such plots, we need to heed the difference in ride counts between members and casual riders).
We should also examine which types of bikes casual riders are using compared to members.
ggplot(final_data, aes(x=rideable_type)) +
geom_bar(aes(fill=member_casual), position="dodge")
This quick glance (a) requires care again because of the different numbers of casual rides vs member rides and (b) is muddled by the presence of the docked bike type; let’s try this again counting docked_bike as a classic_bike (per the blog mentioned in processing.
final_data %>%
mutate(rideable_type = replace(
rideable_type, rideable_type == "docked_bike", "classic_bike")) %>%
ggplot(aes(x=rideable_type)) +
geom_bar(aes(fill=member_casual), position="dodge")
Assuming this combining is correct, we can observe what seems to be a preference among casual riders for electric bikes. Members, on the other hand, actually lean towards the classic bikes. That said, the lack of an extreme imbalance suggests no obvious obstacle, at least as far as these preferences go, to some casual riders becoming members.
Similar to our analysis of ride durations, let’s try examining distances. Here we will need to limit our analysis to non-zero distances for, as mentioned in the processing stage, we could have a ride that starts and stops at the same station, and that would look as if the ride had gone nowhere. But only considering greater than zero will not be enough to capture cases which yield only a few inches or feet, so let’s limit the results to American football field length: 100 yards (between endzones) or
football_length_meters <- 100 * 0.9144
Let’s check how many rows this will exclude and how it will affect each rider group:
final_data %>%
group_by(member_casual) %>%
filter(ride_dist_meters < football_length_meters) %>%
summarize(count=n())
These counts aren’t great but not the end of the world given our sample size, though we must be cautious because this exclusion does seem to affect casual rides more than member rides and this is the opposite of what we would expect based only on the overall rider type proportions (60/40 members to casual riders).
final_data %>%
filter(ride_dist_meters >= football_length_meters) %>%
summarize(filtered_count=n(), min=min(ride_dist_meters),
max=max(ride_dist_meters), mean=mean(ride_dist_meters),
median=median(ride_dist_meters))
…and now broken down by casual riders vs. members.
final_data %>%
filter(ride_dist_meters >= football_length_meters) %>%
group_by(member_casual) %>%
summarize(min=min(ride_dist_meters), max=max(ride_dist_meters),
mean=mean(ride_dist_meters), median=median(ride_dist_meters))
As with ride durations, we again have noticeable rightward skew evidenced by the means being considerably larger than the medians.
Visualizing the distributions here will also be tricky, particularly given the outliers; let’s try looking at kilometers
final_data %>%
filter(ride_dist_meters >= football_length_meters) %>%
ggplot(aes(x=ride_dist_meters / 1000)) + geom_density(aes(fill=member_casual), alpha=0.5)
Given more time it would likely also be informative to consider distance in the context of days of the week or monthly/seasonal behavior,as we did for ride durations, but the above appears to conform to what we gathered from ride duration data. Members tend to be more consistent in ride patterns, with a much stronger peak around shorter trips, possibly commuting to or from work or local social events, whereas casual riders are more diverse and include longer rides, possibly an event in themselves (e.g. a long weekend bike ride).
Initially I intended to make use of the longitude and latitude data to consider differences between users by geography, mapping out proportions of casual rides to member rides at different stations. However, due to a lack of time and knowing that even if creating the initial graph might not be very complex, refining it for interpretation or any sharing would likely be very difficult given great density as you zoom out and only a limited review as you zoom in. For context, Divvy’s map gives an indication of how this might be challenging to visualize.
The act phase in this scenario consists of two items: (1) my top three recommendations and (2) sharing this analysis with the wider world!
My recommendations are the following:
For sharing, this work will be hosted on my website (there’s a good chance that’s how you found it!), as a projects item linking to the repository and as a standalone page showing this notebook in all its beauty.
Few projects are ever really complete, and this one is no exception. Limited in time by my work schedule, I tried to do a thorough analysis in R, but would like to continue this work at some point in the following ways: