#NOTES median income in chicago 2019 is $61,811 Source: City Data retrieval date: 08/JAN/2022
Formula used to convert distance travelled from LatLong: formula retrieved from: stackoverflow answered by: JMX on: Aug 13 ’15 at 0:20
which has been modified to the following for the first row:
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs. Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.
The different types of parties involved with the decision making:
Cyclistic is interested in identifying the number of riders there are with each customer type.
The client Cyclistic is interested in understanding the difference in use by the different types of members. What are the benefits for the non-annual members to become annual members. Lily Moreno already has an opinion:
“Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers”
It is true that these members already have a knowledge about the product, but there may be different factors to consider, the cost of an annual membership, the distances traveled and the ease of use (pickup location and drop-off).
Three main questions that are intended to be answered are as follows:
How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?
Why would (or would not) someone buy the annual membership. this can be compared to the average income of a Chicago citizen. the annual membership is also billed upfront annually and may cause hesitance to purchase. A comparison of the following: (by cost, distance traveled, number of commuters per transportation type)
Single-ride passes, (aka. casual riders)
Full-day passes, (aka. casual riders)
Annual memberships, (aka. Cyclistic members)
The client has provided the sources for reliable information. (see Heading: Information Sources). The information of the year 2021 has been downloaded and put in a folder. The csv files have been extracted from the zip files and opened.
For ease of calculations and use, the “Duration” between the start time and end time has been calculated as follows:
= ended_at - started_at
Calculating the “Distance_traveled” by each trip has been done with the following formula and the result is in Km: (the formula has been verified by: latlongdata.com)
= ACOS(COS(RADIANS(90- [start_lat] )) * COS(RADIANS(90 - [end_lat] )) +SIN(RADIANS(90 - [start_lat] )) * SIN(RADIANS(90 - [end_lat] )) * COS(RADIANS( [start_lng] - [end_lng] ))) * 6371
In addition, the day of the week has been added with the following formula: (this makes Sunday = 1, Monday = 2,…)
= MATCH(TEXT( [started_at] ,“ddd”),{“Sun”,“Mon”,“Tue”,“Wed”,“Thu”,“Fri”,“Sat”},0)
The ride_id are unique variables and do not reflect the user, therefore it is not possible to identify the individual rider’s behavior and to clarify the user’s use of their membership.
The information has been directly retrieved from their servers, therefore as it is first hand data, it is credible.
Both Excel and R will be used for performing the analysis.
Below, the different codes are displayed with their corresponding explanations for reason of use, process and the results are shown below.
Below are the packages used for the analysis
install.packages("tidyverse")
install.packages("rmarkdown")
install.packages("bookdown")
install.packages("plyr") # Used for merging multiple CSV's into one dataframe
install.packages("modeest")
install.packages("calendR")
install.packages('readxl')
Activating the different libraries
Importing the all “*-divvy-tripdata.csv”, the information is to be stored in “tripdata”, and the column names are illustrated
Rows that have missing information to be removed
ctripdata <- drop_na(tripdata_original)
Volume removed: tripdata_original = 5,643,268 rows ctripdata (after dropping NA) = 4,588,186 rows which represents a reduction of 1,055,082 rows (18.7%)
ctripdata <- ctripdata %>%
select(-c(start_station_id,end_station_id))
—Converting the date and time column from character data type to dateTime, and adding a separate column for just the date This will help in the filtering and sorting
ctripdata$started_on <- as.Date(ctripdata$started_at, "%d/%m/%Y %H:%M")
ctripdata$ended_on <- as.Date(ctripdata$ended_at, "%d/%m/%Y %H:%M")
ctripdata$started_at <- parse_date_time(ctripdata$started_at, "%d/%m/%Y %H:%M")
ctripdata$ended_at <- parse_date_time(ctripdata$ended_at, "%d/%m/%Y %H:%M")
#adding extra columns for day of the week, day, month, year
ctripdata$d_day_of_week <- format(ctripdata$started_at, "%a") #three letter weekday is used for better presentation in charts
ctripdata$d_day <- as.numeric(format(ctripdata$started_at, "%d"))
ctripdata$d_month <- as.numeric(format(ctripdata$started_at, "%m"))
ctripdata$d_year <- as.numeric(format(ctripdata$started_at, "%Y"))
#adding another column for Ride_Duration to seconds
ctripdata$ride_duration_sec <- as.numeric(ctripdata$ride_duration)
Verifying that all the data compiled have been added to the same column names and verifying that the duration is still a time format.
ctripdata <- ctripdata %>%
arrange(started_at) #arranging table in ascending order by "started_at"
# print(ctripdata)
# summary(ctripdata)
# str(ctripdata)
Demonstrating the Min, Q1, Median, Mean, Q3 and Max in minutes & fraction.
ctripdata %>%
group_by(d_month) %>%
summarise(min_mins = as.numeric(min(ride_duration))/60,
q1_mins = as.numeric(quantile(ride_duration, 0.25))/60,
median_mins = as.numeric(median(ride_duration))/60,
mean_mins = as.numeric(mean(ride_duration))/60,
q3_mins = as.numeric(quantile(ride_duration, 0.75))/60,
max_mins = as.numeric(max(ride_duration))/60)
## # A tibble: 12 x 7
## d_month min_mins q1_mins median_mins mean_mins q3_mins max_mins
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 5.65 9.33 13.9 16.2 1403.
## 2 2 0 6.62 11.1 18.3 19.6 1428.
## 3 3 0 6.95 12.5 20.8 23.3 1435.
## 4 4 0 7.23 12.9 21.7 24.0 1440.
## 5 5 0 7.77 14 23.6 26.0 1438.
## 6 6 0 7.8 13.7 22.7 24.6 1438.
## 7 7 0 7.72 13.4 21.8 24.0 1426.
## 8 8 0 7.37 12.8 20.5 22.8 1434.
## 9 9 0 6.98 12.1 19.4 21.6 1439.
## 10 10 0 6.18 10.6 17.0 18.8 1437.
## 11 11 0.0167 5.28 8.87 13.5 15.2 1434.
## 12 12 0 5.15 8.6 13.1 14.7 1432.
Demonstrating the Min, Q1, Median, Mean, Q3 and Max in time format (only issue is with Mean & Max not showing a time format but a seconds format).
ctripdata %>%
group_by(d_month) %>%
summarise(min_mins = min(ride_duration),
q1_mins = quantile(ride_duration,0.25),
median_mins = median(ride_duration),
mean_mins = mean(ride_duration),
q3_mins = quantile(ride_duration,0.75),
max_mins = max(ride_duration))
## # A tibble: 12 x 7
## d_month min_mins q1_mins median_mins mean_mins q3_mins max_mins
## <dbl> <drtn> <time> <time> <drtn> <time> <drtn>
## 1 1 0 secs 05'39" 09'20" 835.9534 secs 16'13.00" 84194 secs
## 2 2 0 secs 06'37" 11'04" 1098.8898 secs 19'35.25" 85658 secs
## 3 3 0 secs 06'57" 12'28" 1245.2138 secs 23'19.00" 86086 secs
## 4 4 0 secs 07'14" 12'54" 1302.2292 secs 24'01.00" 86397 secs
## 5 5 0 secs 07'46" 14'00" 1418.9013 secs 25'57.00" 86274 secs
## 6 6 0 secs 07'48" 13'42" 1361.7368 secs 24'34.00" 86279 secs
## 7 7 0 secs 07'43" 13'27" 1307.1804 secs 23'59.00" 85581 secs
## 8 8 0 secs 07'22" 12'48" 1230.1429 secs 22'49.00" 86043 secs
## 9 9 0 secs 06'59" 12'08" 1163.2396 secs 21'35.00" 86362 secs
## 10 10 0 secs 06'11" 10'37" 1019.3046 secs 18'50.00" 86244 secs
## 11 11 1 secs 05'17" 08'52" 812.1569 secs 15'12.00" 86048 secs
## 12 12 0 secs 05'09" 08'36" 788.9096 secs 14'40.00" 85913 secs
Demonstrating the Min, Q1, Median, Mean, Q3 and Max in time format for members
ctripdata %>%
group_by(d_month) %>%
filter(member_casual == "member") %>%
summarise(min_mins = as.numeric(min(ride_duration))/60,
q1_mins = as.numeric(quantile(ride_duration, 0.25))/60,
median_mins = as.numeric(median(ride_duration))/60,
mean_mins = as.numeric(mean(ride_duration))/60,
q3_mins = as.numeric(quantile(ride_duration, 0.75))/60,
max_mins = as.numeric(max(ride_duration))/60)
## # A tibble: 12 x 7
## d_month min_mins q1_mins median_mins mean_mins q3_mins max_mins
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 5.35 8.75 12.0 15.0 1227.
## 2 2 0 6.2 10.2 14.7 17.2 1338.
## 3 3 0 5.87 10.0 13.7 17.3 1354.
## 4 4 0 6.08 10.4 14.2 17.9 1406.
## 5 5 0 6.17 10.6 14.3 18.1 1365.
## 6 6 0 6.27 10.6 14.1 17.9 1322.
## 7 7 0 6.12 10.4 13.8 17.4 1263.
## 8 8 0 5.97 10.1 13.5 17.1 1348.
## 9 9 0 5.73 9.73 13.1 16.5 1318.
## 10 10 0 5.27 8.83 12.0 14.9 1415.
## 11 11 0.0167 4.82 7.97 10.9 13.3 1427.
## 12 12 0 4.72 7.82 10.6 13 1231.
Demonstrating the Min, Q1, Median, Mean, Q3 and Max in time format for casual
ctripdata %>%
group_by(d_month) %>%
filter(member_casual == "casual") %>%
summarise(min_mins = as.numeric(min(ride_duration))/60,
q1_mins = as.numeric(quantile(ride_duration, 0.25))/60,
median_mins = as.numeric(median(ride_duration))/60,
mean_mins = as.numeric(mean(ride_duration))/60,
q3_mins = as.numeric(quantile(ride_duration, 0.75))/60,
max_mins = as.numeric(max(ride_duration))/60)
## # A tibble: 12 x 7
## d_month min_mins q1_mins median_mins mean_mins q3_mins max_mins
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.0167 7.62 12.6 22.8 23.2 1403.
## 2 2 0.0333 9.42 16.7 32.8 30.9 1428.
## 3 3 0.0167 10.4 19.4 33.0 36.7 1435.
## 4 4 0 10.2 18.8 32.7 36.2 1440.
## 5 5 0 10.8 19.7 33.7 36.9 1438.
## 6 6 0 10.4 18.1 31.3 33.0 1438.
## 7 7 0 9.92 17.2 28.8 31.0 1426.
## 8 8 0 9.52 16.4 27.3 29.5 1434.
## 9 9 0 9.17 15.7 26.4 28.6 1439.
## 10 10 0 8.37 14.4 24.6 26.4 1437.
## 11 11 0.0167 7.12 11.9 20.5 21.7 1434.
## 12 12 0.0167 6.95 11.6 20.6 21.1 1432.
Day mostly used by type of user.
aggregate(ctripdata$d_day_of_week ~ ctripdata$member_casual, FUN = mfv)
## ctripdata$member_casual ctripdata$d_day_of_week
## 1 casual Sat
## 2 member Wed
Average ride duration by member/casual
agg_ctripdata <- ctripdata %>%
select(c(member_casual,ride_duration,ride_duration_sec)) #selecting only the relevant columns
aggregate(agg_ctripdata, by = list(agg_ctripdata$member_casual),FUN = mean)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Group.1 member_casual ride_duration ride_duration_sec
## 1 casual NA 1715.505 secs 1715.505
## 2 member NA 790.481 secs 790.481
remove(agg_ctripdata)
for the sorting of the day of the week in the correct order and making Sunday the first day of the week
ctripdata$d_day_of_week <- ordered(ctripdata$d_day_of_week, levels=c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))
Average ride duration per day for all users
agg_ctripdata <- ctripdata %>%
select(c(ride_duration,ride_duration_sec,d_day_of_week))
aggregate(agg_ctripdata, by = list(agg_ctripdata$d_day_of_week),FUN = mean)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Group.1 ride_duration ride_duration_sec d_day_of_week
## 1 Sun 1509.118 secs 1509.118 NA
## 2 Mon 1155.191 secs 1155.191 NA
## 3 Tue 1037.411 secs 1037.411 NA
## 4 Wed 1007.889 secs 1007.889 NA
## 5 Thu 1011.946 secs 1011.946 NA
## 6 Fri 1132.455 secs 1132.455 NA
## 7 Sat 1434.589 secs 1434.589 NA
Average ride duration per day for member users
agg_ctripdata <- ctripdata %>%
filter(member_casual == "member") %>%
select(c(ride_duration,ride_duration_sec,d_day_of_week)) #selecting only the relevant columns
aggregate(agg_ctripdata, by = list(agg_ctripdata$d_day_of_week),FUN = mean)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Group.1 ride_duration ride_duration_sec d_day_of_week
## 1 Sun 911.1003 secs 911.1003 NA
## 2 Mon 763.0025 secs 763.0025 NA
## 3 Tue 742.8148 secs 742.8148 NA
## 4 Wed 747.0413 secs 747.0413 NA
## 5 Thu 740.5237 secs 740.5237 NA
## 6 Fri 767.3038 secs 767.3038 NA
## 7 Sat 888.1996 secs 888.1996 NA
Average ride duration per day for casual users
agg_ctripdata <- ctripdata %>%
filter(member_casual == "casual") %>%
select(c(ride_duration,ride_duration_sec,d_day_of_week)) #selecting only the relevant columns
aggregate(agg_ctripdata, by = list(agg_ctripdata$d_day_of_week),FUN = mean)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Group.1 ride_duration ride_duration_sec d_day_of_week
## 1 Sun 1970.070 secs 1970.070 NA
## 2 Mon 1748.751 secs 1748.751 NA
## 3 Tue 1569.394 secs 1569.394 NA
## 4 Wed 1483.454 secs 1483.454 NA
## 5 Thu 1464.066 secs 1464.066 NA
## 6 Fri 1592.963 secs 1592.963 NA
## 7 Sat 1851.188 secs 1851.188 NA
For an easy comparison of the ride duration per user type on each of the different weekday.
aggregate(ctripdata$ride_duration/60 ~ ctripdata$member_casual + ctripdata$d_day_of_week, FUN = mean)
## ctripdata$member_casual ctripdata$d_day_of_week ctripdata$ride_duration/60
## 1 casual Sun 32.83451 secs
## 2 member Sun 15.18500 secs
## 3 casual Mon 29.14585 secs
## 4 member Mon 12.71671 secs
## 5 casual Tue 26.15657 secs
## 6 member Tue 12.38025 secs
## 7 casual Wed 24.72424 secs
## 8 member Wed 12.45069 secs
## 9 casual Thu 24.40111 secs
## 10 member Thu 12.34206 secs
## 11 casual Fri 26.54938 secs
## 12 member Fri 12.78840 secs
## 13 casual Sat 30.85314 secs
## 14 member Sat 14.80333 secs