#NOTES median income in chicago 2019 is $61,811 Source: City Data retrieval date: 08/JAN/2022
Formula used to convert distance travelled from LatLong: formula retrieved from: stackoverflow answered by: JMX on: Aug 13 ’15 at 0:20
which has been modified to the following for the first row:
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs. Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.
The different types of parties involved with the decision making:
Cyclistic is interested in identifying the number of riders there are with each customer type.
The client Cyclistic is interested in understanding the difference in use by the different types of members. What are the benefits for the non-annual members to become annual members. Lily Moreno already has an opinion:
“Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers”
It is true that these members already have a knowledge about the product, but there may be different factors to consider, the cost of an annual membership, the distances traveled and the ease of use (pickup location and drop-off).
Three main questions that are intended to be answered are as follows:
How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?
Why would (or would not) someone buy the annual membership. this can be compared to the average income of a Chicago citizen. the annual membership is also billed upfront annually and may cause hesitance to purchase. A comparison of the following: (by cost, distance traveled, number of commuters per transportation type)
Single-ride passes, (aka. casual riders)
Full-day passes, (aka. casual riders)
Annual memberships, (aka. Cyclistic members)
The client has provided the sources for reliable information. (see Heading: Information Sources). The information of the year 2021 has been downloaded and put in a folder. The csv files have been extracted from the zip files and opened.
For ease of calculations and use, the “Duration” between the start time and end time has been calculated as follows:
= ended_at - started_at
Calculating the “Distance_traveled” by each trip has been done with the following formula and the result is in Km: (the formula has been verified by: latlongdata.com)
= ACOS(COS(RADIANS(90- [start_lat] )) * COS(RADIANS(90 - [end_lat] )) +SIN(RADIANS(90 - [start_lat] )) * SIN(RADIANS(90 - [end_lat] )) * COS(RADIANS( [start_lng] - [end_lng] ))) * 6371
In addition, the day of the week has been added with the following formula: (this makes Sunday = 1, Monday = 2,…)
= MATCH(TEXT( [started_at] ,“ddd”),{“Sun”,“Mon”,“Tue”,“Wed”,“Thu”,“Fri”,“Sat”},0)
The ride_id are unique variables and do not reflect the user, therefore it is not possible to identify the individual rider’s behavior and to clarify the user’s use of their membership.
The information has been directly retrieved from their servers, therefore as it is first hand data, it is credible.
Both Excel and R will be used for performing the analysis.
Below, the different codes are displayed with their corresponding explanations for reason of use, process and the results are shown below.
Below are the packages used for the analysis
# install.packages("tidyverse")
# install.packages("rmarkdown")
# install.packages("bookdown")
# install.packages("plyr") #Used for merging multiple CSV's into one dataframe
# install.packages("modeest")
# install.packages("calendR")
# install.packages("readxl")
Activating the different libraries
Importing the all “*-divvy-tripdata.csv”, the information is to be stored in “tripdata”, and the column names are illustrated
ctripdata <- list.files(path = "D:/Desktop/CPortfolio/", #location of the csv's on my device
pattern = "*.csv", full.names = TRUE) %>%
lapply(read_csv) %>%
bind_rows
Rows that have missing information to be removed
ctripdata <- drop_na(ctripdata)
Volume removed: tripdata_original = 5,643,268 rows ctripdata (after dropping NA) = 4,588,186 rows which represents a reduction of 1,055,082 rows (18.7%)
—Converting the date and time column from character data type to dateTime, and adding a separate column for just the date This will help in the filtering and sorting
ctripdata$started_on <- as.Date(ctripdata$started_at, "%d/%m/%Y %H:%M")
ctripdata$ended_on <- as.Date(ctripdata$ended_at, "%d/%m/%Y %H:%M")
ctripdata$started_at <- parse_date_time(ctripdata$started_at, "%d/%m/%Y %H:%M")
ctripdata$ended_at <- parse_date_time(ctripdata$ended_at, "%d/%m/%Y %H:%M")
#adding extra columns for day of the week, day, month, year
ctripdata <- ctripdata %>%
mutate(d_hour=hour(ymd_hms(ctripdata$started_at))) #adding a column for the hour of the day
ctripdata$d_day_of_week <- format(ctripdata$started_at, "%a") #three letter weekday is used for better presentation in charts
ctripdata$d_day <- as.numeric(format(ctripdata$started_at, "%d")) #adding a column for the day of the month
ctripdata$d_month <- as.numeric(format(ctripdata$started_at, "%m")) #adding a column for the month of the year
ctripdata$d_year <- as.numeric(format(ctripdata$started_at, "%Y")) #adding a column for the year of the trip
#adding another column for Ride_Duration to seconds
ctripdata$ride_duration_sec <- as.numeric(ctripdata$ride_duration)
#sorting the data by the Starter At in Ascending order
ctripdata <- ctripdata %>%
arrange(started_at)
#making sure that the day of the week will be displayed correctly in the plots
ctripdata$d_day_of_week <- ordered(ctripdata$d_day_of_week, levels=c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))
for backup purposed and not needing to redo the above constantly, the file is saved: (the file is also used in MS Power BI)
Demonstrating the Min, Q1, Median, Mean, Q3 and Max in minutes & fraction.
ctripdata %>%
group_by(d_month) %>%
summarise(min_mins = as.numeric(min(ride_duration))/60,
q1_mins = as.numeric(quantile(ride_duration, 0.25))/60,
median_mins = as.numeric(median(ride_duration))/60,
mean_mins = as.numeric(mean(ride_duration))/60,
q3_mins = as.numeric(quantile(ride_duration, 0.75))/60,
max_mins = as.numeric(max(ride_duration))/60)
## # A tibble: 12 x 7
## d_month min_mins q1_mins median_mins mean_mins q3_mins max_mins
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 5.65 9.33 13.9 16.2 1403.
## 2 2 0 6.62 11.1 18.3 19.6 1428.
## 3 3 0 6.95 12.5 20.8 23.3 1435.
## 4 4 0 7.23 12.9 21.7 24.0 1440.
## 5 5 0 7.77 14 23.6 26.0 1438.
## 6 6 0 7.8 13.7 22.7 24.6 1438.
## 7 7 0 7.72 13.4 21.8 24.0 1426.
## 8 8 0 7.37 12.8 20.5 22.8 1434.
## 9 9 0 6.98 12.1 19.4 21.6 1439.
## 10 10 0 6.18 10.6 17.0 18.8 1437.
## 11 11 0.0167 5.28 8.87 13.5 15.2 1434.
## 12 12 0 5.15 8.6 13.1 14.7 1432.
Demonstrating the Min, Q1, Median, Mean, Q3 and Max in time format (only issue is with Mean & Max not showing a time format but a seconds format).
ctripdata %>%
group_by(d_month) %>%
summarise(min_mins = min(ride_duration),
q1_mins = quantile(ride_duration,0.25),
median_mins = median(ride_duration),
mean_mins = mean(ride_duration),
q3_mins = quantile(ride_duration,0.75),
max_mins = max(ride_duration))
## # A tibble: 12 x 7
## d_month min_mins q1_mins median_mins mean_mins q3_mins max_mins
## <dbl> <drtn> <time> <time> <drtn> <time> <drtn>
## 1 1 0 secs 05'39" 09'20" 835.9534 secs 16'13.00" 84194 secs
## 2 2 0 secs 06'37" 11'04" 1098.8898 secs 19'35.25" 85658 secs
## 3 3 0 secs 06'57" 12'28" 1245.2138 secs 23'19.00" 86086 secs
## 4 4 0 secs 07'14" 12'54" 1302.2292 secs 24'01.00" 86397 secs
## 5 5 0 secs 07'46" 14'00" 1418.9013 secs 25'57.00" 86274 secs
## 6 6 0 secs 07'48" 13'42" 1361.7368 secs 24'34.00" 86279 secs
## 7 7 0 secs 07'43" 13'27" 1307.1804 secs 23'59.00" 85581 secs
## 8 8 0 secs 07'22" 12'48" 1230.1429 secs 22'49.00" 86043 secs
## 9 9 0 secs 06'59" 12'08" 1163.2396 secs 21'35.00" 86362 secs
## 10 10 0 secs 06'11" 10'37" 1019.3046 secs 18'50.00" 86244 secs
## 11 11 1 secs 05'17" 08'52" 812.1569 secs 15'12.00" 86048 secs
## 12 12 0 secs 05'09" 08'36" 788.9096 secs 14'40.00" 85913 secs
Demonstrating the Min, Q1, Median, Mean, Q3 and Max in time format for members
ctripdata %>%
group_by(d_month) %>%
filter(member_casual == "member") %>%
summarise(min_mins = as.numeric(min(ride_duration))/60,
q1_mins = as.numeric(quantile(ride_duration, 0.25))/60,
median_mins = as.numeric(median(ride_duration))/60,
mean_mins = as.numeric(mean(ride_duration))/60,
q3_mins = as.numeric(quantile(ride_duration, 0.75))/60,
max_mins = as.numeric(max(ride_duration))/60)
## # A tibble: 12 x 7
## d_month min_mins q1_mins median_mins mean_mins q3_mins max_mins
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 5.35 8.75 12.0 15.0 1227.
## 2 2 0 6.2 10.2 14.7 17.2 1338.
## 3 3 0 5.87 10.0 13.7 17.3 1354.
## 4 4 0 6.08 10.4 14.2 17.9 1406.
## 5 5 0 6.17 10.6 14.3 18.1 1365.
## 6 6 0 6.27 10.6 14.1 17.9 1322.
## 7 7 0 6.12 10.4 13.8 17.4 1263.
## 8 8 0 5.97 10.1 13.5 17.1 1348.
## 9 9 0 5.73 9.73 13.1 16.5 1318.
## 10 10 0 5.27 8.83 12.0 14.9 1415.
## 11 11 0.0167 4.82 7.97 10.9 13.3 1427.
## 12 12 0 4.72 7.82 10.6 13 1231.
Demonstrating the Min, Q1, Median, Mean, Q3 and Max in time format for casual
ctripdata %>%
group_by(d_month) %>%
filter(member_casual == "casual") %>%
summarise(min_mins = as.numeric(min(ride_duration))/60,
q1_mins = as.numeric(quantile(ride_duration, 0.25))/60,
median_mins = as.numeric(median(ride_duration))/60,
mean_mins = as.numeric(mean(ride_duration))/60,
q3_mins = as.numeric(quantile(ride_duration, 0.75))/60,
max_mins = as.numeric(max(ride_duration))/60)
## # A tibble: 12 x 7
## d_month min_mins q1_mins median_mins mean_mins q3_mins max_mins
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.0167 7.62 12.6 22.8 23.2 1403.
## 2 2 0.0333 9.42 16.7 32.8 30.9 1428.
## 3 3 0.0167 10.4 19.4 33.0 36.7 1435.
## 4 4 0 10.2 18.8 32.7 36.2 1440.
## 5 5 0 10.8 19.7 33.7 36.9 1438.
## 6 6 0 10.4 18.1 31.3 33.0 1438.
## 7 7 0 9.92 17.2 28.8 31.0 1426.
## 8 8 0 9.52 16.4 27.3 29.5 1434.
## 9 9 0 9.17 15.7 26.4 28.6 1439.
## 10 10 0 8.37 14.4 24.6 26.4 1437.
## 11 11 0.0167 7.12 11.9 20.5 21.7 1434.
## 12 12 0.0167 6.95 11.6 20.6 21.1 1432.
Weekday mostly used by type of user.
aggregate(ctripdata$d_day_of_week ~ ctripdata$member_casual, FUN = mfv)
## ctripdata$member_casual ctripdata$d_day_of_week
## 1 casual Sat
## 2 member Wed
Average ride duration overall
Average ride duration by member/casual
agg_ctripdata <- ctripdata %>%
select(c(member_casual,ride_duration,ride_duration_sec)) #selecting only the relevant columns
aggregate(agg_ctripdata, by = list(agg_ctripdata$member_casual),FUN = mean)
## Group.1 member_casual ride_duration ride_duration_sec
## 1 casual NA 1715.505 secs 1715.505
## 2 member NA 790.481 secs 790.481
remove(agg_ctripdata)
Average ride duration per day for all users
agg_ctripdata <- ctripdata %>%
select(c(ride_duration,ride_duration_sec,d_day_of_week))
aggregate(agg_ctripdata, by = list(agg_ctripdata$d_day_of_week),FUN = mean)
## Group.1 ride_duration ride_duration_sec d_day_of_week
## 1 Sun 1509.118 secs 1509.118 NA
## 2 Mon 1155.191 secs 1155.191 NA
## 3 Tue 1037.411 secs 1037.411 NA
## 4 Wed 1007.889 secs 1007.889 NA
## 5 Thu 1011.946 secs 1011.946 NA
## 6 Fri 1132.455 secs 1132.455 NA
## 7 Sat 1434.589 secs 1434.589 NA
remove(agg_ctripdata)
Average ride duration per day for member users
agg_ctripdata <- ctripdata %>%
filter(member_casual == "member") %>%
select(c(ride_duration,ride_duration_sec,d_day_of_week)) #selecting only the relevant columns
aggregate(agg_ctripdata, by = list(agg_ctripdata$d_day_of_week),FUN = mean)
## Group.1 ride_duration ride_duration_sec d_day_of_week
## 1 Sun 911.1003 secs 911.1003 NA
## 2 Mon 763.0025 secs 763.0025 NA
## 3 Tue 742.8148 secs 742.8148 NA
## 4 Wed 747.0413 secs 747.0413 NA
## 5 Thu 740.5237 secs 740.5237 NA
## 6 Fri 767.3038 secs 767.3038 NA
## 7 Sat 888.1996 secs 888.1996 NA
remove(agg_ctripdata)
Average ride duration per day for casual users
agg_ctripdata <- ctripdata %>%
filter(member_casual == "casual") %>%
select(c(ride_duration,ride_duration_sec,d_day_of_week)) #selecting only the relevant columns
aggregate(agg_ctripdata, by = list(agg_ctripdata$d_day_of_week),FUN = mean)
## Group.1 ride_duration ride_duration_sec d_day_of_week
## 1 Sun 1970.070 secs 1970.070 NA
## 2 Mon 1748.751 secs 1748.751 NA
## 3 Tue 1569.394 secs 1569.394 NA
## 4 Wed 1483.454 secs 1483.454 NA
## 5 Thu 1464.066 secs 1464.066 NA
## 6 Fri 1592.963 secs 1592.963 NA
## 7 Sat 1851.188 secs 1851.188 NA
remove(agg_ctripdata)
For an easy comparison of the ride duration per user type on each of the different weekday.
aggregate(ctripdata$ride_duration ~ ctripdata$member_casual + ctripdata$d_day_of_week, FUN = mean)
## ctripdata$member_casual ctripdata$d_day_of_week ctripdata$ride_duration
## 1 casual Sun 1970.0704 secs
## 2 member Sun 911.1003 secs
## 3 casual Mon 1748.7511 secs
## 4 member Mon 763.0025 secs
## 5 casual Tue 1569.3943 secs
## 6 member Tue 742.8148 secs
## 7 casual Wed 1483.4542 secs
## 8 member Wed 747.0413 secs
## 9 casual Thu 1464.0664 secs
## 10 member Thu 740.5237 secs
## 11 casual Fri 1592.9626 secs
## 12 member Fri 767.3038 secs
## 13 casual Sat 1851.1882 secs
## 14 member Sat 888.1996 secs
Finding the correlation between temperatures and use of the Divvy Bikes
ctemps <- read_xlsx("D:/Desktop/CPortfolio/Chicago Temperatures.xlsx")
Correlation between temperatures and the number of trips:
trips <- ctripdata %>%
count(started_on) %>%
select(-c(started_on))
temps <- ctemps %>%
select(c(temp_max_c))
cor(trips,temps)
## temp_max_c
## n 0.8822117
remove(trips,temps)
aggregate(ctripdata$ride_duration ~ ctripdata$member_casual + ctripdata$d_day_of_week, FUN = mean)
## ctripdata$member_casual ctripdata$d_day_of_week ctripdata$ride_duration
## 1 casual Sun 1970.0704 secs
## 2 member Sun 911.1003 secs
## 3 casual Mon 1748.7511 secs
## 4 member Mon 763.0025 secs
## 5 casual Tue 1569.3943 secs
## 6 member Tue 742.8148 secs
## 7 casual Wed 1483.4542 secs
## 8 member Wed 747.0413 secs
## 9 casual Thu 1464.0664 secs
## 10 member Thu 740.5237 secs
## 11 casual Fri 1592.9626 secs
## 12 member Fri 767.3038 secs
## 13 casual Sat 1851.1882 secs
## 14 member Sat 888.1996 secs
Answering the 4 main questions provided in the Ask phase: