Purpose

This R notebook contains my end-to-end work on the Cyclistic bike-share case study, performed as a “capstone” project for Google’s excellent Data Analytics certificate program on Coursera.

It is hosted in the following repository, where you can also access READMEs and a data changelog: https://github.com/dmallia17/CyclisticBike-ShareAnalysis

The Google program recommends a data analysis process of 6 steps: ask, prepare, process, analyze, share and act. Accordingly, this notebook is structured around those steps.

Please note that some inspiration for the below work comes from an R script, which was provided in the course and is based on this blog.

# Note this attaches ggplot2, lubridate, readr, stringr, dplyr and tidyr
library(tidyverse)
library(skimr)
library(naniar)
library(geosphere)

# Including this line to disable a "friendly" but unhelpful warning we get
# when grouping by multiple columns
# https://stackoverflow.com/questions/62140483/how-to-interpret-dplyr-message-summarise-regrouping-output-by-x-override
options(dplyr.summarise.inform=F)

# Set this to false if you want to download everything for the first time
DOWNLOADED <- TRUE
# Set this to true if you need to rewrite files (e.g. false is for when
# you just need to rerun chunks for output)
REWRITE <- FALSE

1. Ask

Let’s start by briefly spelling out the (fictional) scenario for this case study for some context. I am a junior data analyst (woohoo!) on the marketing analyst team at Cyclistic, a bike-sharing company in Chicago. Following the conclusion from the finance analysts at Cyclistic that annual memberships are more profitable than casual riders (who use single ride or day passes), the director of marketing wants to work on converting casual riders into Cyclistic members who have an annual membership. This is motivated by the idea that these casual consumers are already aware of Cyclistic, and may be drawn by the company’s offerings of alternative bike types suitable for those for disabilities. To support this initiative, the key question for my work to answer with data, is how do casual riders differ from those with annual memberships?

Business Task

Make use of the company’s historical bike trip data to understand differences between casual riders and those with annual memberships, and thereby unlock insights and recommendations for a marketing program designed to convert casual riders into Cyclistic members.

Key Stakeholders

  • Lily Moreno (Director of Marketing, and my manager)
  • Marketing Analytics Team (my colleagues)
  • Executive Team (company executives, who have final say on the proposed marketing program)

2. Prepare

Original Data Organization and Data Retrieval

Because Cyclistic is a fictional company, we are making use of Divvy trip datasets made publicly available here under this license. A quick glance at this data repository reveals that the data format has changed over time, with ride data offered first (in 2013) with a file for a full year, then in per quarter segments, and then currently by month. Moreover, the naming scheme has changed in accordance with some of these changes, and consequently the index is not in chronological order.

For purposes of this analysis, we will work with the most recent full year’s worth of data, the 2022 data. This requires downloading the 12 files which constitute the 2022 collection. We will store the original csv files in a subdirectory called “data”.

if(!DOWNLOADED) {
  src_dir <- getwd()
  dir.create("data")
  setwd(paste0(src_dir,"/data"))

  for(i in 1:12){
    padded_i <- str_pad(i, 2, pad="0")
    temp_url <- sprintf(
      "https://divvy-tripdata.s3.amazonaws.com/2022%s-divvy-tripdata.zip",
      padded_i)
    destfile <- sprintf("2022%s-divvy-tripdata.zip", padded_i)
    download.file(temp_url, destfile=destfile)
    unzip(destfile)
    file.remove(destfile)
    file.remove("__MACOSX")
  }
}

General Characteristics of the Data

Even before opening a single file, we can make a couple of observations about the nature of the data we have to work with. This would constitute first party data as (in our fictional scenario) this is our company’s data. In a real-world scenario, we would ensure that we are retrieving the data directly from a company source and verifying with data engineers or other analysts, that this is the correct and up-to-date data. Nonetheless, we should treat this data as any other, and perform checks both statistical and visual to ensure data integrity and catch errors. Simple storage in a sub-directory for our analysis is fine for these purposes; as we combine data and produce a larger file for the full year’s analysis, we will want to follow good naming conventions and maintain a changelog.

We will be reviewing a full year’s worth of data, which should negate the bias that might arise from seasonal behavior (though we will want to consider seasonal trends in our analysis). Moreover, the data is from 2022, which is both recent (always important, but particularly the case for bike-sharing which is a more recent and growing trend) and after the major COVID shutdowns, so we should see behavior that is representative of the present behavior of Cyclistic customers. It is worth acknowledging that this data is, of course, limited to Cyclistic customers - and thereby, bike-sharing in a purely Chicago context - so we would need to be cautious in generalizing outside of the company or to other localities, but it should be appropriate given our task of considering differences between Cyclistic casual riders and members.

Inspection of the Data

To dig further, it is time to actually open the files and begin to look for issues, and compile a file with the full year’s worth of data.

One of the files is named “publictripdata” instead of “tripdata”, but we can quickly make that consistent:

if(!DOWNLOADED){
  file.rename("data/202209-divvy-publictripdata.csv",
              "data/202209-divvy-tripdata.csv")
}
tripdata_by_month <- list()
for(i in 1:12){
  padded_i <- str_pad(i, 2, pad="0")
  csv_name <- sprintf("data/2022%s-divvy-tripdata.csv", padded_i)
  tripdata_by_month[[i]] <- read_csv(csv_name)
}
## Rows: 103770 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 115609 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 284042 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 371249 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 634858 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 769204 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 823488 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 785932 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 701339 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 558685 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 337735 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 181806 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The nice thing about reading in csv files with readr is we get a quick summary, giving rows, columns, the delimiter detected, the titles of columns and their detected types. Nonetheless, before we try to combine them into a single dataframe, let’s check all column names are identical.

# Check that all column names are identical
trip_colnames <- colnames(tripdata_by_month[[1]])
print(paste("All column names identical:",
  all(sapply(tripdata_by_month, function(x) {identical(colnames(x), trip_colnames)}))))
## [1] "All column names identical: TRUE"

Let’s also take a look at the details of at least one month’s data. It would be best to do this for every month but for time and space considerations, plus the fact that we can perform some checks better in a visual fashion, we can just run it for January:

skim_without_charts(tripdata_by_month[[1]])
Data summary
Name tripdata_by_month[[1]]
Number of rows 103770
Number of columns 13
_______________________
Column type frequency:
character 7
numeric 4
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1.00 16 16 0 103770 0
rideable_type 0 1.00 11 13 0 3 0
start_station_name 16260 0.84 10 53 0 758 0
start_station_id 16260 0.84 3 44 0 758 0
end_station_name 17927 0.83 10 51 0 724 0
end_station_id 17927 0.83 3 44 0 724 0
member_casual 0 1.00 6 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
start_lat 0 1 41.90 0.05 41.65 41.88 41.89 41.93 45.64
start_lng 0 1 -87.65 0.05 -87.83 -87.66 -87.64 -87.63 -73.80
end_lat 86 1 41.90 0.05 41.65 41.88 41.90 41.93 42.07
end_lng 86 1 -87.65 0.03 -87.83 -87.66 -87.64 -87.63 -87.52

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
started_at 0 1 2022-01-01 00:00:05 2022-01-31 23:58:37 2022-01-16 11:50:04 100315
ended_at 0 1 2022-01-01 00:01:48 2022-02-01 01:46:16 2022-01-16 12:04:35 100047

This output gives us an overall summary of the dataframe, and then statistics for the character type columns, numeric type columns and finally date type columns.

To inspect some values, let’s use glimpse:

glimpse(tripdata_by_month[[1]])
## Rows: 103,770
## Columns: 13
## $ ride_id            <chr> "C2F7DD78E82EC875", "A6CF8980A652D272", "BD0F91DFF7…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "classic_bike", "…
## $ started_at         <dttm> 2022-01-13 11:59:47, 2022-01-10 08:41:56, 2022-01-…
## $ ended_at           <dttm> 2022-01-13 12:02:44, 2022-01-10 08:46:17, 2022-01-…
## $ start_station_name <chr> "Glenwood Ave & Touhy Ave", "Glenwood Ave & Touhy A…
## $ start_station_id   <chr> "525", "525", "TA1306000016", "KA1504000151", "TA13…
## $ end_station_name   <chr> "Clark St & Touhy Ave", "Clark St & Touhy Ave", "Gr…
## $ end_station_id     <chr> "RP-007", "RP-007", "TA1307000001", "TA1309000021",…
## $ start_lat          <dbl> 42.01280, 42.01276, 41.92560, 41.98359, 41.87785, 4…
## $ start_lng          <dbl> -87.66591, -87.66597, -87.65371, -87.66915, -87.624…
## $ end_lat            <dbl> 42.01256, 42.01256, 41.92533, 41.96151, 41.88462, 4…
## $ end_lng            <dbl> -87.67437, -87.67437, -87.66580, -87.67139, -87.627…
## $ member_casual      <chr> "casual", "casual", "member", "casual", "member", "…

We can see we have a ride id, a rideable type (the type of vehicle, such as “electric_bike” or “classic_bike”), the start and end times of rides (note that we actually have an end time in February, the “max” time for the January data), start and end station names and ids, start and end latitudes and longitudes, and whether or not the rider was a member or a “casual” rider. Of course, we also have the month column we added in earlier.

We can quickly check the unique values for rideable_type and member_casual:

print(unique(tripdata_by_month[[1]]$rideable_type))
## [1] "electric_bike" "classic_bike"  "docked_bike"
print(unique(tripdata_by_month[[1]]$member_casual))
## [1] "casual" "member"

NOTE Two other items jump out here:

  • We have missing values for station names and ids, as well as trip end latitude and longtidue…

  • We have 103770 rows and 103770 unique ride ids, so all seems well for January, but we should check this holds for all other months.

For our final step, let’s combine the dataframes into one:

all_tripdata <- bind_rows(tripdata_by_month)

We can save this result as a file, so we don’t need to repeat these steps in the future.

if(REWRITE) {
  write_csv(all_tripdata, "2022-divvy-tripdata-V01.csv")
}

3. Process

Now we can begin work on all of our 2022 data, getting some sense of the nature and quality of the data, extracting additional features (columns) and cleaning data as need be.

First, we can reload the data, if needed.

defined <- ls()
if(!("all_tripdata" %in% defined)) {
  all_tripdata <- read_csv("2022-divvy-tripdata-V01.csv")
}

Next, let’s take a quick look at the top or head of the dataset for a refresh on content.

head(all_tripdata)

From this we can see we have 13 columns, how about the number of rows for 2022 combined? Again, we can use skim_without_charts to check this and retrieve a lot more information all in one call.

skim_without_charts(all_tripdata)
Data summary
Name all_tripdata
Number of rows 5667717
Number of columns 13
_______________________
Column type frequency:
character 7
numeric 4
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1.00 16 16 0 5667717 0
rideable_type 0 1.00 11 13 0 3 0
start_station_name 833064 0.85 7 64 0 1674 0
start_station_id 833064 0.85 3 44 0 1313 0
end_station_name 892742 0.84 9 64 0 1692 0
end_station_id 892742 0.84 3 44 0 1317 0
member_casual 0 1.00 6 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
start_lat 0 1 41.90 0.05 41.64 41.88 41.90 41.93 45.64
start_lng 0 1 -87.65 0.03 -87.84 -87.66 -87.64 -87.63 -73.80
end_lat 5858 1 41.90 0.07 0.00 41.88 41.90 41.93 42.37
end_lng 5858 1 -87.65 0.11 -88.14 -87.66 -87.64 -87.63 0.00

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
started_at 0 1 2022-01-01 00:00:05 2022-12-31 23:59:26 2022-07-22 15:03:59 4745862
ended_at 0 1 2022-01-01 00:01:48 2023-01-02 04:56:45 2022-07-22 15:24:44 4758633

It looks like all told, we have over 5 million trips to analyze! That’s a nice sample size, one which would likely be overwhelming for spreadsheet software (Excel / Google Sheets) so it’s a good thing we are using R (though SQL would also be a good option here).

Let’s look over the results column by column to get to know the data better.

Categorical Data

ride_id

These are 16 character long ids, and it there are (thankfully) as many unique ids as there are rows in the data.

rideable_type

These give us the actual type of vehicles used; let’s see a quick breakdown by type.

table(all_tripdata$rideable_type)
## 
##  classic_bike   docked_bike electric_bike 
##       2601214        177474       2889029

Classic and electric bikes clearly make up the vast majority of the data. The Divvy website doesn’t explain what a docked bike is, but it sounds like this may not be a normal ride? We will revisit this below.

Station names and ids

Per the names, these are the names (strings) of stations and the ids assigned to them. Around 15% of the data is missing here; if we peek ahead, we can see that we have the majority of latitude and longitude data, so it is certainly not a good idea to simply discard rows missing these values, particularly when, if desired, we may be able to recover the correct values using latitude and longitude. The frequency of missingness seems consistent across name and id, so if we’re missing one, we are missing the other, but we can also check that to be 100% sure.

print(paste("Number missing both start name and id:",
  nrow(all_tripdata %>%
    filter(is.na(start_station_name) & is.na(start_station_id)))))
## [1] "Number missing both start name and id: 833064"
print(paste("Number missing both end name and id:",
  nrow(all_tripdata %>%
    filter(is.na(end_station_name) & is.na(end_station_id)))))
## [1] "Number missing both end name and id: 892742"

Interestingly, there are considerably more names than ids, and a few more end ids than start ids. If we have time, we can revisit this to see if we have misspelled (and therefore duplicated) station names, but this assumes we want to make use of the names.

member_casual

This identifies whether or not the ride was by a member or a casual rider (hence only 2 unique values); let’s check the breakdown.

table(all_tripdata$member_casual)
## 
##  casual  member 
## 2322032 3345685
prop.table(table(all_tripdata$member_casual))
## 
##    casual    member 
## 0.4096944 0.5903056

Excellent, while the data does skew in favor of members (~60/40), we also have quite a sample of casual riders.

Numerical and Date data

Start and end latitude and longitudes

What these offer is no surprise. As mentioned above, there is a small amount of missingness for end locations; once we have considered other information we can extract, we’ll likely want to come back and remove any remaining such cases given that we want full rides and that there are so few cases.

Start and end datetimes

Again, per the name, these give us full datetime (year, month, day, hour, minute, second) for a ride’s start and stop. As we saw with the January data, the max value actually exceeds 2022, as least one person rode into the new year! Thankfully, there is no missingness here; this is great news given that we can make use of these columns to calculate ride durations (not present in these files). We can also unpack these further to ask questions about daily or monthly ride patterns.

Missingness overview

Before we calculate additional columns, we can get a quick snapshot of what the missingness looks like in the data, and then revisit it once we’ve executed any deletion based on missingness or other problematic values. To do this, we can use the wonderful naniar package. Note that there is a known issue - also see here - with naniar producing an extra blank plot in the previewed html output, so we should only pay attention to the second plot.

gg_miss_upset(all_tripdata, nsets=6)

This captures in a nice visual fashion what we saw above and more: station names and ids are always missing together (in the sense of both start or end station name and id will be missing, not just name or id). Some cases have just starts missing, others ends, some both, and there is a set of almost six thousand cases where the end station information will be unrecoverable, given that latitude and longitude information is also missing.

Calculated Data And Cleaning

Now we can add in additional columns based on the datetime columns. We don’t need to account for year, as this is all 2022 data. In addition to drawing on inspiration from the blog mentioned at the top of this notebook for generating new columns, this blog post inspired me to also include hours, so we can consider riding habits during different parts of the day.

all_tripdata <- all_tripdata %>%
  mutate(
    ride_datetime = as_datetime(started_at),
    month = month(ride_datetime, label=T),
    day = day(ride_datetime),
    day_name = wday(ride_datetime, label=T),
    hour = hour(ride_datetime)
  )
head(all_tripdata)

And let’s add ride durations:

# Note for subsequent work, we should convert the results of difftime, instances
# of class "difftime" to numeric; specifically we'll use integers as we don't
# need fractions of seconds, and can then easily check the number of zero
# length trips
all_tripdata <- all_tripdata %>%
  mutate(ride_length = as.integer(difftime(ended_at,started_at), units="secs"))
head(all_tripdata)

Now let’s perform a couple of checks and start doing some cleaning. Calling summary on ride_length gives us a place to start.

summary(all_tripdata %>% select(ride_length))
##   ride_length     
##  Min.   :-621201  
##  1st Qu.:    349  
##  Median :    617  
##  Mean   :   1167  
##  3rd Qu.:   1108  
##  Max.   :2483235

Clearly there are negative durations present, which should not be preserved. Communication with other data engineers or analysts may be able to resolve where these came from and prevent them in the future, but for now we will want to see how many such cases are present. Moreover, it stands to reason that there are likely also zero length cases, and, given that this data has these errors (which are described as having been removed on Divvy’s website), then we should also check for another issue they address: trips less than 60 seconds long.

print(paste("Number of rows with negative ride length:",
            nrow(all_tripdata %>% filter(ride_length < 0))))
## [1] "Number of rows with negative ride length: 100"
print(paste("Number of rows with empty ride length:",
            nrow(all_tripdata %>% filter(ride_length == 0))))
## [1] "Number of rows with empty ride length: 431"
print(paste("Number of rows with positive ride length less than 60 seconds:",
            nrow(all_tripdata %>% filter(ride_length > 0 & ride_length < 60))))
## [1] "Number of rows with positive ride length less than 60 seconds: 120558"

The blog which is of some inspiration to this analysis mentions a separate but related issue, where some rides start at station “HQ QR”, when the ride was in fact an inspection. However, the below shows that there are no such cases in the 2022 data

nrow(all_tripdata %>% filter(start_station_name == "HQ QR"))
## [1] 0

Let’s delete rows with bad trip durations, creating a new dataframe with the result:

start_count <- nrow(all_tripdata)
all_tripdata_v2 <- all_tripdata %>%
  filter(ride_length >= 60)
print(paste("Removed", start_count - nrow(all_tripdata_v2), "rows"))
## [1] "Removed 121089 rows"
print(paste("Number of rows now missing end lat/lon:",
  nrow(all_tripdata_v2 %>%
    filter(is.na(end_lat) & is.na(end_lng)))))
## [1] "Number of rows now missing end lat/lon: 5856"

Interestingly, this shows us that focusing on duration alone did not address the cases with missing end latitude and longitude. Let’s remove those now:

start_count <- nrow(all_tripdata_v2)
all_tripdata_v2 <- all_tripdata_v2 %>%
  filter(!(is.na(end_lat) & is.na(end_lng)))
print(paste("Removed", start_count - nrow(all_tripdata_v2), "rows"))
## [1] "Removed 5856 rows"

Checking the distribution of rideable_type after these removals shows that “docked_bike” does not in fact correspond to an erroneous ride type. In fact, this blog suggests “docked_bike” might be an older name for “classic_bike”. Moreover, the number of these instances is comparatively small, so we will leave them as is.

table(all_tripdata_v2$rideable_type)
## 
##  classic_bike   docked_bike electric_bike 
##       2559448        173344       2807980

Finally, let’s revisit the missingness overview.

gg_miss_upset(all_tripdata_v2, nsets=6)

Here we write out the intermediate result after our processing and cleaning.

if(REWRITE) {
  write_csv(all_tripdata_v2, "2022-divvy-tripdata-V02.csv")
}

In addition, for resuming our work in this notebook it is best to save out the data as an RDS object. Otherwise, if we try to reload the data only from csv, we will need to do extra work to convert day_name and month back into ordered form, as they will be interpreted as just characters.

if(REWRITE) {
  saveRDS(all_tripdata_v2, "2022-divvy-tripdata-V02.RDS")
}

Additional Cleaning and Calculations - 9/29/2023

Work on the analysis phase naturally reveals additional problems missed during initial processing, as well as additional needs. To improve on the initial processing, let’s do the following, beginning with reading in the data.

all_tripdata_v3 <- readRDS("2022-divvy-tripdata-V02.RDS")

First, the number of the day in the month was initially included for considering trends evolving over the course of a month, but this is unlikely to yield any significant patterns, so we can drop day for now. Second, analyzing trends by hour requires using both start and end hours, so let’s rename hour to the more accurate indicator of start_hour and also include an end hour. In a similar vein, ride_datetime was an extraneous column as started_at and ended_at were already interpreted as datetimes, and we should also rename our day_name and month columns to indicate that these are for the start of the ride. Third, initially we left ride duration in seconds, but consider the following.

summary(all_tripdata_v3$ride_length)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      60     364     630    1001    1122 2061244

Examining these numbers requires a lot of mental arithmetic. Now, in minutes, broken down by rider type:

all_tripdata_v3 %>%
  mutate(ride_length=ride_length/60) %>%
  group_by(member_casual) %>%
  summarize(min=min(ride_length), max=max(ride_length), mean=mean(ride_length),
            median=median(ride_length))

This is a lot easier to read but the max for casual rides clearly evidences a problem of outliers. There are 1440 minutes in a day, so the above shows us that the max any member “rode” was for a little over a day; for casual riders the max is around 24 days… Let’s check how many rides overall were for casual members, and then how many of those exceeded the maximum member ride duration. We can check the number of cases exceeding a day.

total_casual <- nrow(all_tripdata_v3 %>% filter(member_casual == "casual"))
total_casual_exceeding_member_max <- nrow(
  all_tripdata_v3 %>%
    filter(member_casual == "casual" & (ride_length / 60.0) > 1500))
total_casual
## [1] 2268975
total_casual_exceeding_member_max
## [1] 66

Out of over 2 million casual rides, only 66 exceeded the member max (a little over a day). Thus, for our fourth change, we drop these comparatively few cases that are unlikely to tell us much about casual riders, capping duration at 1500 minutes (a little over the observed maximum member ride duration).

The below accomplishes these 4 changes in what is basically a one-liner. dyplr is pretty neat!

start_count <- nrow(all_tripdata_v3)
all_tripdata_v3 <- all_tripdata_v3 %>%
  select(-c(day, ride_datetime)) %>%
  rename(start_hour=hour, start_day_name=day_name, start_month=month) %>%
  mutate(end_hour = hour(ended_at),
         ride_len_mins = ride_length / 60.0) %>%
  select(-ride_length) %>%
  filter(ride_len_mins <= 1500)
print(paste("Removed", start_count - nrow(all_tripdata_v3), "rows"))
## [1] "Removed 66 rows"
head(all_tripdata_v3)

An additional change needed for considering longitude and latitude data is removing erroneous values in these columns. Previously we discarded cases with missing values, but a closer look back at the ranges of values these take on reveals problems even where we “have” values:

all_tripdata_v3 %>%
  select(start_lat,end_lat,start_lng, end_lng) %>%
  skim_without_charts()
Data summary
Name Piped data
Number of rows 5540706
Number of columns 4
_______________________
Column type frequency:
numeric 4
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
start_lat 0 1 41.90 0.05 41.64 41.88 41.90 41.93 45.64
end_lat 0 1 41.90 0.07 0.00 41.88 41.90 41.93 42.37
start_lng 0 1 -87.65 0.03 -87.84 -87.66 -87.64 -87.63 -73.80
end_lng 0 1 -87.65 0.10 -88.14 -87.66 -87.64 -87.63 0.00

We have a zero as the the minimum end latitude and as the maximum end longitude (per the maps::map() documentation, longitudes to the west of Greenwich are negative); the maximum start latitude is a line north of Illinois and the maximum start longitude is east of Illinois. Without more precise knowledge of the specific extent of Divvy’s network other than the map provided here, let’s check how many cases violate the boundaries of Illinois, referring to the bounding box information here:

y_lat_min <- 36.970298
y_lat_max <- 42.508481
x_lng_min <- -91.513079
x_lng_max <- -87.494756
violating_case_num <- nrow(all_tripdata_v3 %>%
  filter(start_lng < x_lng_min | end_lng < x_lng_min |
    start_lng > x_lng_max | end_lng > x_lng_max |
    start_lat < y_lat_min | end_lat < y_lat_min |
    start_lat > y_lat_max | end_lat > y_lat_max))
violating_case_num
## [1] 12

Thankfully we only have 12 cases, and can discard these without much concern.

start_count <- nrow(all_tripdata_v3)
all_tripdata_v3 <- all_tripdata_v3 %>%
    filter(!(start_lng < x_lng_min | end_lng < x_lng_min |
    start_lng > x_lng_max | end_lng > x_lng_max |
    start_lat < y_lat_min | end_lat < y_lat_min |
    start_lat > y_lat_max | end_lat > y_lat_max))
print(paste("Removed", start_count - nrow(all_tripdata_v3), "rows"))
## [1] "Removed 12 rows"

With these columns in better shape, we can now create a column for distance based on calculations with the start and end coordinates. To do this we can use the geosphere package’s distGeo for an accurate distance calculation between starting and ending longitude, latitude pairs; results are given in meters. These distances will be somewhat unreliable for a couple of reasons:

  • Users may not take the most optimal route
  • “Straight line” distance will be inappropriate given city blocks and obstacles like buildings
  • A ride may start and end at the same station, so the distance will appear to be 0

Getting these calculations to work were very tricky, as this a row-wise calculation, where 2 groups of 2 columns each need to be formed, so accomplishing it proved very difficult. I am leaving some prior attempts which either exhausted memory or took too long until I found this very helpful stack overflow thread which seems to have started in a similar place and the author found a solution which had not occurred to me but reduced processing time from endless minutes to only a few seconds.

# EXHAUSTS MEMORY
# all_tripdata_v3$ride_dist_meters <- distm(
#   cbind(all_tripdata_v3$start_lng, all_tripdata_v3$start_lat),
#   cbind(all_tripdata_v3$end_lng, all_tripdata_v3$end_lat)
# )[1,]

# Runs "indefinitely"
# all_tripdata_v3 <- all_tripdata_v3 %>%
#   rowwise() %>%
#   mutate(ride_dist_meters = distGeo(
#     c_across(c(start_lng, start_lat)), c_across(c(end_lng, end_lat))))

# Encounters a vector length error; postscript - this is because this would
# quite literally concatenate the 2 columns end after end for 1 very long
# vector!
# all_tripdata_v3 <- all_tripdata_v3 %>%
#   mutate(ride_dist_meters=distGeo(
#     c(start_lng, start_lat), c(end_lng, end_lat)))

all_tripdata_v3 <- all_tripdata_v3 %>%
  mutate(ride_dist_meters = distGeo(
    cbind(start_lng, start_lat), cbind(end_lng, end_lat)
  ))

Now we can write out this refined dataset.

if(REWRITE) {
  write_csv(all_tripdata_v3, "2022-divvy-tripdata-V03.csv")
  saveRDS(all_tripdata_v3, "2022-divvy-tripdata-V03.RDS")
}

4. Analyze

Again, we can reload the data, if needed.

defined <- ls()
if(!("final_data" %in% defined)) {
  final_data <- readRDS("2022-divvy-tripdata-V03.RDS")
}

Member types

Now that we have our cleaned data, let’s first look at the numbers and proportions of rides by rider types:

table(final_data$member_casual)
## 
##  casual  member 
## 2268904 3271790
prop.table(table(final_data$member_casual))
## 
##    casual    member 
## 0.4094982 0.5905018

Thankfully the deletions executed in the cleaning phase have not fundamentally altered the proportions of rides.

Ride Durations

As a first step, we can consider the minimum, maximum, mean and median ride lengths. Let’s look at this overall…

summary(final_data$ride_len_mins)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.000    6.067   10.500   16.605   18.700 1499.950

…and now broken down by casual riders vs. members.

final_data %>%
  group_by(member_casual) %>%
  summarize(min=min(ride_len_mins), max=max(ride_len_mins), mean=mean(ride_len_mins),
            median=median(ride_len_mins))

The fact that the mean is significantly greater than the median overall and for both groups suggests there is considerable rightward skew, with the mean being inflated by outlier long rides.

Inspecting this visually is tricky even accounting for major outliers in cleaning, given the very broad range of values. One option is to constrain the range further to get a better sense of distributions.

final_data %>%
  filter(ride_len_mins <= 120) %>%
  ggplot(aes(x=ride_len_mins)) + geom_density(aes(fill=member_casual), alpha=0.5)

This plot does confirm that the distribution for members has a sharper peak covering the range of several minute rides, whereas casual riders clearly skew towards longer rides. We can hypothesize that this may reflect ride motivations: members may be primarily using rides for short trips to and from work, or social gatherings, whereas more casual riders may be using bikes for longer outings (an end in themselves).

Day of the Week Behavior

A quick starting place for understanding behavior by day is to examine the frequency of rides by day, broken into casual and member rides. NOTE: this plot and others like it below require careful interpretation, given that the distribution of member rides to casual rides is not 50/50, but in fact closer to 60/40. That said, we can still get a feeling for the different distributions side by side.

ggplot(final_data, aes(x=start_day_name)) +
  geom_bar(aes(fill=member_casual), position="dodge")

The above gives us some indication that more casual rides occur on weekends than weekdays; members rides are somewhat more evenly distributed but do emphasize weekday use. Again, this does suggest difference of reasons for usage.

Let’s look at the fine-toothed breakdown of statistics by day. It turns out that the minimum allowed (1 minute) occurs every day of the week, and values close to the max (again we limit our analysis to the max member ride length), so let’s just retrieve mean and median. We group by start_day_name first so as to compare values day by day.

final_data %>%
  group_by(start_day_name, member_casual) %>%
  summarize(mean(ride_len_mins), median(ride_len_mins))

We can also examine this visually (let’s just examine median):

final_data %>%
  group_by(start_day_name, member_casual) %>%
  summarize(median=median(ride_len_mins)) %>%
  ggplot(aes(x=start_day_name, y=median)) +
  geom_col(aes(fill=member_casual), position="dodge")

Overall, the actual per-ride statistics conform to the picture given by ride counts per day (as well as the raw ride statistics we examined before), where overall casual users go for longer rides and members and casual riders differ on how they behave by day. Member ride statistics don’t vary as much day by day (though durations do increase a little on weekends, perhaps when rides can be more leisurely), whereas casual rides do shorten by multiple minutes (e.g. consider Sunday casual median of ~15 minutes vs. Tuesday casual median of ~12 minutes). Geographic and distance traveled information may help provide more information on what is happening in the casual group, but this does suggest that there may be multiple groups within casual riders:

  • a (perhaps small) group that are more likely to ride on any day of the week and who are responsible for the shorter, consistent behaviors on weekdays;

  • and a likely larger group who primarily ride for leisure on weekends. The first group would be the ideal targets of a campaign for converting casual riders to members!

Monthly/Seasonal Behavior

Again, for monthly behavior as for daily behavior, we can by visualizing the counts of rides by month (heeding the aforementioned warning about the proportions of casual rides to member rides).

ggplot(final_data, aes(x=start_month)) +
  geom_bar(aes(fill=member_casual), position="dodge")

Unsurprisingly, the months of December, January and February are noticeably less popular for members and casual riders, as these are winter months in North America, and many will seek other options for transportation in the face of low temperatures, snow and ice. Warmer and safer weather months certainly see a lot more bike usage, seeing increases by a factor of 4+. Yet, members again seem to evidence more consistency (as with their ride durations and use across the week), having a less peaked distribution and in fact the months with maximum use for are later than for casual riders.

Let’s consider statistics by month:

final_data %>%
  group_by(start_month, member_casual) %>%
  summarize(mean=mean(ride_len_mins), median=median(ride_len_mins))

Again, we can also visualize this (for the median):

final_data %>%
  group_by(start_month, member_casual) %>%
  summarize(median=median(ride_len_mins)) %>%
  ggplot(aes(x=start_month, y=median)) +
  geom_col(aes(fill=member_casual), position="dodge")

As with ridership by day, it seems there are two groups of riders among casual rides, and this is worth mining further.

Hourly Behavior

Days were an intuitive starting place, and months give us a higher level overview, but we can also zoom in further to consider hours of usage. It is an interesting avenue to mine, but is somewhat difficult to review and plot, given that we want to consider counts by start and end hour pairs, and our analysis should be flexible enough to consider where rides may exceed the boundary of an hour.

One way to examine hours of usage is to make a plot counting the frequency for each combination of a ride start hour and end hour. We can easily create this 24x24 grid using ggplot’s geom_count and this has the advantage over a heatmap approach in that we can make use of both size and color here: in the below we will use size to indicates raw count, and color to capture proportion. We make separate plots for casual rides and member rides, as otherwise the plots will become too small to interpret. This stackoverflow thread was helpful in identifying how to set up our axes.

# Some setup for nice axis labels
breaks <- seq(0,23,3) # Breaks to display on the axis
hrs <- c(12,3,6,9)
# Create 12hr clock labels
labels <- c(paste0(hrs, "AM"), paste0(hrs, "PM"))
final_data %>%
  filter(member_casual == "casual") %>%
  mutate(end_hour = hour(ended_at)) %>%
  ggplot(aes(x=start_hour, y=end_hour)) +
  geom_count(aes(color=after_stat(prop))) +
  scale_x_continuous(breaks = breaks, labels=labels) +
  scale_y_continuous(breaks = breaks, labels=labels) +
  labs(title="Counts for Casual Rides")

The above does require some careful reading and interpretation. Moreover, it is complicated by the fact that you can have rides that last for multiple hours before midnight (all values above the y=x line) or even rides which carry over into the subsequent day (all values below the y=x line). Nevertheless, the plot gives a fascinating window into hourly trends:

  1. First, the fact that the majority of rides are concentrated around the y=x line, or a little above the line, confirms that the majority of rides are short. The ride begins in the hour and ends in the same hour or perhaps the next.
  2. Unsurprisingly, many rides occur in the daylight hours, but there are also clearly plenty of night owls among casual riders, with bikes being taken out and returned at all times of the day.
  3. There is something of a peak in usage around 5PM, which is somewhat unexpected given that these are casual riders. Again, this suggests that there is likely a group among them that do use these bikes somewhat regularly, perhaps for commuting home, and that would constitute possible converts to Cyclistic members.
final_data %>%
  filter(member_casual == "member") %>%
  mutate(end_hour = hour(ended_at)) %>%
  ggplot(aes(x=start_hour, y=end_hour)) +
  geom_count(aes(color=after_stat(prop))) +
  scale_x_continuous(breaks = breaks, labels=labels) +
  scale_y_continuous(breaks = breaks, labels=labels) +
  labs(title="Counts for Member Rides")

The story for members has some clear differences (though as with many such plots, we need to heed the difference in ride counts between members and casual riders).

  1. The areas far above and below the y=x line are much more sparse, while the y=x line (and a little above) itself remains a strong concentration. This suggests more consistent, regular usage with less instances of bikes being out for many hours or even into the next day.
  2. We see an even stronger concentration around 5PM for members than was present for casual rides, as well as a considerable number in the morning, around 8AM. This would be consistent with use for rides to or from work.

Bike Type Usage

We should also examine which types of bikes casual riders are using compared to members.

ggplot(final_data, aes(x=rideable_type)) +
  geom_bar(aes(fill=member_casual), position="dodge")

This quick glance (a) requires care again because of the different numbers of casual rides vs member rides and (b) is muddled by the presence of the docked bike type; let’s try this again counting docked_bike as a classic_bike (per the blog mentioned in processing.

final_data %>%
  mutate(rideable_type = replace(
    rideable_type, rideable_type == "docked_bike", "classic_bike")) %>%
  ggplot(aes(x=rideable_type)) +
    geom_bar(aes(fill=member_casual), position="dodge")

Assuming this combining is correct, we can observe what seems to be a preference among casual riders for electric bikes. Members, on the other hand, actually lean towards the classic bikes. That said, the lack of an extreme imbalance suggests no obvious obstacle, at least as far as these preferences go, to some casual riders becoming members.

Distance Behavior

Similar to our analysis of ride durations, let’s try examining distances. Here we will need to limit our analysis to non-zero distances for, as mentioned in the processing stage, we could have a ride that starts and stops at the same station, and that would look as if the ride had gone nowhere. But only considering greater than zero will not be enough to capture cases which yield only a few inches or feet, so let’s limit the results to American football field length: 100 yards (between endzones) or

football_length_meters <- 100 * 0.9144

Let’s check how many rows this will exclude and how it will affect each rider group:

final_data %>%
  group_by(member_casual) %>%
  filter(ride_dist_meters < football_length_meters) %>%
  summarize(count=n())

These counts aren’t great but not the end of the world given our sample size, though we must be cautious because this exclusion does seem to affect casual rides more than member rides and this is the opposite of what we would expect based only on the overall rider type proportions (60/40 members to casual riders).

final_data %>%
  filter(ride_dist_meters >= football_length_meters) %>%
  summarize(filtered_count=n(), min=min(ride_dist_meters),
            max=max(ride_dist_meters), mean=mean(ride_dist_meters),
            median=median(ride_dist_meters))

…and now broken down by casual riders vs. members.

final_data %>%
  filter(ride_dist_meters >= football_length_meters) %>%
  group_by(member_casual) %>%
  summarize(min=min(ride_dist_meters), max=max(ride_dist_meters),
            mean=mean(ride_dist_meters), median=median(ride_dist_meters))

As with ride durations, we again have noticeable rightward skew evidenced by the means being considerably larger than the medians.

Visualizing the distributions here will also be tricky, particularly given the outliers; let’s try looking at kilometers

final_data %>%
  filter(ride_dist_meters >= football_length_meters) %>%
  ggplot(aes(x=ride_dist_meters / 1000)) + geom_density(aes(fill=member_casual), alpha=0.5)

Given more time it would likely also be informative to consider distance in the context of days of the week or monthly/seasonal behavior,as we did for ride durations, but the above appears to conform to what we gathered from ride duration data. Members tend to be more consistent in ride patterns, with a much stronger peak around shorter trips, possibly commuting to or from work or local social events, whereas casual riders are more diverse and include longer rides, possibly an event in themselves (e.g. a long weekend bike ride).

Geographic Behavior

Initially I intended to make use of the longitude and latitude data to consider differences between users by geography, mapping out proportions of casual rides to member rides at different stations. However, due to a lack of time and knowing that even if creating the initial graph might not be very complex, refining it for interpretation or any sharing would likely be very difficult given great density as you zoom out and only a limited review as you zoom in. For context, Divvy’s map gives an indication of how this might be challenging to visualize.

5. Share

Rather than be repetitive in this notebook, and have summaries in more than one location, this section will contain (a) visualizations that would be shared with stakeholders and (b) a summary of the key findings from the analysis. The next section, Act, will contain my top three recommendations based on these findings.

Visualizations

In the interest of keeping this work self-contained, we will use a small subset of the visualizations we created above, but refined to be clearer for sharing with others. Given more time, it would be worth exporting this data to Tableau and creating more visuals there, particularly for the geographic visualizations which would likely be far easier than working in R.

To facilitate just re-running this section, we can load in the data if needed.

defined <- ls()
if(!("final_data" %in% defined)) {
  final_data <- readRDS("2022-divvy-tripdata-V03.RDS")
}
final_data %>%
  mutate(rideable_type = replace(
    rideable_type, rideable_type == "docked_bike", "classic_bike")) %>%
  ggplot(aes(x=rideable_type)) +
    geom_bar(aes(fill=member_casual), position="dodge") +
    scale_x_discrete("Bike Type",
      labels=c("classic_bike"="Classic", "electric_bike"="Electric")) +
    scale_fill_discrete("Member Type",
      labels=c("casual"="Casual", "member"="Member")) +
    labs(title="Count of Rides by Bike Type",
         caption="A 'Docked Bike' is counted as a 'Classic Bike'.",
         x="Bike Type", y="Count", fill="Member Type")

ggplot(final_data, aes(x=start_day_name)) +
  geom_bar(aes(fill=member_casual), position="dodge") +
  scale_fill_discrete("Member Type",
                      labels=c("casual"="Casual", "member"="Member")) +
  labs(title="Count of Rides by Day",
       x="Day", y="Count", fill="Member Type")

ggplot(final_data, aes(x=start_month)) +
  geom_bar(aes(fill=member_casual), position="dodge") +
  scale_fill_discrete("Member Type",
    labels=c("casual"="Casual", "member"="Member")) +
  labs(title="Count of Rides by Month",
      x="Month", y="Count", fill="Member Type")

# Some setup for nice axis labels
breaks <- seq(0,23,3) # Breaks to display on the axis
hrs <- c(12,3,6,9)
# Create 12hr clock labels
labels <- c(paste0(hrs, "AM"), paste0(hrs, "PM"))
final_data %>%
  filter(member_casual == "casual") %>%
  mutate(end_hour = hour(ended_at)) %>%
  ggplot(aes(x=start_hour, y=end_hour)) +
  geom_count(aes(color=after_stat(prop))) +
  scale_x_continuous(breaks = breaks, labels=labels) +
  scale_y_continuous(breaks = breaks, labels=labels) +
  scale_color_gradient(low="red", high="green") +
  labs(title="Start and End Times for Casual Rides",
       subtitle="Dots give pair occurrences, with frequency and proportion shown by size and color",
       caption="The y=x concentration shows short ride duration, while points
       above y=x give multi-hour rides and points below, show trips beyond midnight.",
       x="Start hour", y="End Hour", size="Count", color="Proportion")

final_data %>%
  filter(member_casual == "member") %>%
  mutate(end_hour = hour(ended_at)) %>%
  ggplot(aes(x=start_hour, y=end_hour)) +
  geom_count(aes(color=after_stat(prop))) +
  scale_x_continuous(breaks = breaks, labels=labels) +
  scale_y_continuous(breaks = breaks, labels=labels) +
  scale_color_gradient(low="red", high="green") +
  labs(title="Start and End Times for Member Rides",
       subtitle="Dots give pair occurrences, with frequency and proportion shown by size and color",
       caption="The y=x concentration shows short ride duration, while points
       above y=x give multi-hour rides and points below, show trips beyond midnight.",
       x="Start hour", y="End Hour", size="Count", color="Proportion")

Summary

Without being reductive, it does not seem to be unfair to say that the two groups under consideration, Cyclistic members and “casual” riders (who have a single ride or day pass), live up to their designations.

  • Expected seasonal and hourly patterns exist across both groups, as less people will be interested in riding a bike in the cold, wet weather months, or during nighttime hours.
  • Nevertheless, members reflect as more consistent users both in terms of usage by day in the week and by month. They appear more likely to use bikes across the year or across the week, whereas casual rides display much more peaked distributions. For usage by day, casual rides have something of an inverted distribution to that of members, using bikes primarily on weekends. For months, their usage strongly peaks in summer.
  • Considering usage by hours, casual riders are much more likely to use bikes at any time of the day, but there is a commonality across distributions for both groups in that the majority of rides are of short duration, and both actually have considerable concentration during times many people commute home from work.
  • In addition, there seems to be a preference among casual riders for electric bikes over classic bikes, whereas for members this is the reverse.

The above said, consideration of the overlap between distributions - e.g. observing the common usage around times typically used for commuting from work as well as the fact that casual use does not drop to zero on weekdays or during the coldest months - suggests that there may indeed be a group of riders among casual riders, who might consider becoming members.

6. Act

The act phase in this scenario consists of two items: (1) my top three recommendations and (2) sharing this analysis with the wider world!

My recommendations are the following:

  1. Collect survey data: a key question that needs to be answered here is if casual riders have ever considered becoming a member and if so, why they have not done so already? The above findings suggest there may already be many casual riders who are a good fit, and this will give some key insight as to whether it is a matter of awareness, pricing or perhaps another issue such as unavailability of their favorite bike type. It would also be great to know exactly why they are using a bike, such as for commuting to work.
  2. Gather bike path information and conduct geographical analysis: by examining ride patterns by location, insight may be gained into how much suitability for bikes is influencing usage patterns as well as how awareness has perhaps not grown in areas where stations are new.
  3. Create a cost-benefit breakdown for users: The above two steps will help identify the exact types of casual riders that may be influenced, and the areas to focus efforts, but moving forward will require cost analysis that could produce concrete recommendations in the form of “If you take X rides in a week, you would save money by becoming a member!”

For sharing, this work will be hosted on my website (there’s a good chance that’s how you found it!), as a projects item linking to the repository and as a standalone page showing this notebook in all its beauty.

7. Future Work

Few projects are ever really complete, and this one is no exception. Limited in time by my work schedule, I tried to do a thorough analysis in R, but would like to continue this work at some point in the following ways: