R Bike Share

Analyzing Bike Share Data

16 minute read

In this post I’m going to run through the codes used in R to answer a Bike sharing company’s (Cyclistic) question: How do annual members and casual riders use Cyclistic bikes differently?

I am tasked with the following:

1- Identify how annual cyclists and casual ones use the company’s bikes differently

2- Provide recommendations that will help converting casual riders to annual members

To answer this business question, I’ve completed the following: collect data, wrangle & combine it, cleaning & modifying, conducting decriptive analysis, data vizualization, and finally data exportation.

The public data was collected from Divvy, for a company based in Chicago. Below are the license agreement and data location links:

https://ride.divvybikes.com/data-license-agreement & https://divvy-tripdata.s3.amazonaws.com/index.html

The data used for this project was available in csv format on monthly basis from the period from July-2021 to June-2022. Each month’s data was downloaded and extracted from the zp folder, then uploaded to RStudio path. The csv files are named in this format: yyyymm-divvy-tripdata.csv

to conduct the analysis, the below packages and libraries were installed:

# installing required packages for: 

install.packages("tidyverse")     # wrangling data
install.packages("skimr")         # to provide quick descriptive analysis
install.packages("lubridate")     # wrangling data attributes
install.packages("ggplot2")       # for visualization


library(tidyverse)  # wrangle data
library(skimr)      # quick descriptive analysis
library(lubridate)  # wrangle date attributes
library(ggplot2)    # visualize data
library(readr)      # read files
library(dplyr)      # join data frames

getwd()             #displays working directory

STEP 1: Collecting data

The first step was to collect the data from the csv files and add the datasets to the working environment:

# Uploading datasets (csv files)

cyclistic_202107 <- read_csv("202107-divvy-tripdata.csv")
cyclistic_202108 <- read_csv("202108-divvy-tripdata.csv")
cyclistic_202109 <- read_csv("202109-divvy-tripdata.csv")
cyclistic_202110 <- read_csv("202110-divvy-tripdata.csv")
cyclistic_202111 <- read_csv("202111-divvy-tripdata.csv")
cyclistic_202112 <- read_csv("202112-divvy-tripdata.csv")
cyclistic_202201 <- read_csv("202201-divvy-tripdata.csv")
cyclistic_202202 <- read_csv("202202-divvy-tripdata.csv")
cyclistic_202203 <- read_csv("202203-divvy-tripdata.csv")
cyclistic_202204 <- read_csv("202204-divvy-tripdata.csv")
cyclistic_202205 <- read_csv("202205-divvy-tripdata.csv")
cyclistic_202206 <- read_csv("202206-divvy-tripdata.csv")

Then, to ensure that the datasets can be combined into one large datase, the column names and types were checked to ensure that they are matching.

# Comparing column names of each file
# the names don't have to be in order, but they must match *perfectly* to join the data


colnames(cyclistic_202107)
colnames(cyclistic_202108)
colnames(cyclistic_202109)
colnames(cyclistic_202110)
colnames(cyclistic_202111)
colnames(cyclistic_202112)
colnames(cyclistic_202201)
colnames(cyclistic_202202)
colnames(cyclistic_202203)
colnames(cyclistic_202204)
colnames(cyclistic_202205)
colnames(cyclistic_202206)

# all column names match
 >>> [1] "ride_id"            "rideable_type"      "started_at"         "ended_at"           "start_station_name"
     [6] "start_station_id"   "end_station_name"   "end_station_id"     "start_lat"          "start_lng"         
    [11] "end_lat"            "end_lng"            "member_casual"     

# inspecting data types
str(cyclistic_202107)
str(cyclistic_202108)
str(cyclistic_202109)
str(cyclistic_202110)
str(cyclistic_202111)
str(cyclistic_202112)
str(cyclistic_202201)
str(cyclistic_202202)
str(cyclistic_202203)
str(cyclistic_202204)
str(cyclistic_202205)
str(cyclistic_202206)

# all column data types match

>>> spec_tbl_df [769,204 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
     $ ride_id           : chr [1:769204] "600CFD130D0FD2A4" "F5E6B5C1682C6464" "B6EB6D27BAD771D2" "C9C320375DE1D5C6" ...
     $ rideable_type     : chr [1:769204] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
     $ started_at        : POSIXct[1:769204], format: "2022-06-30 17:27:53" "2022-06-30 18:39:52" "2022-06-30 11:49:25" "2022-06-30 11:15:25" ...
     $ ended_at          : POSIXct[1:769204], format: "2022-06-30 17:35:15" "2022-06-30 18:47:28" "2022-06-30 12:02:54" "2022-06-30 11:19:43" ...
     $ start_station_name: chr [1:769204] NA NA NA NA ...
     $ start_station_id  : chr [1:769204] NA NA NA NA ...
     $ end_station_name  : chr [1:769204] NA NA NA NA ...
     $ end_station_id    : chr [1:769204] NA NA NA NA ...
     $ start_lat         : num [1:769204] 41.9 41.9 41.9 41.8 41.9 ...
     $ start_lng         : num [1:769204] -87.6 -87.6 -87.7 -87.7 -87.6 ...
     $ end_lat           : num [1:769204] 41.9 41.9 41.9 41.8 41.9 ...
     $ end_lng           : num [1:769204] -87.6 -87.6 -87.6 -87.7 -87.6 ...
     $ member_casual     : chr [1:769204] "casual" "casual" "casual" "casual" ...
     - attr(*, "spec")=
      .. cols(
      ..   ride_id = col_character(),
      ..   rideable_type = col_character(),
      ..   started_at = col_datetime(format = ""),
      ..   ended_at = col_datetime(format = ""),
      ..   start_station_name = col_character(),
      ..   start_station_id = col_character(),
      ..   end_station_name = col_character(),
      ..   end_station_id = col_character(),
      ..   start_lat = col_double(),
      ..   start_lng = col_double(),
      ..   end_lat = col_double(),
      ..   end_lng = col_double(),
      ..   member_casual = col_character()
      .. )
     - attr(*, "problems")=<externalptr> 

STEP 2: Data wrangling and combining into a single file

to combine the data into one data frame:

# Stacking all monthly data into one big data frame

cyclistic <- bind_rows(                       cyclistic_202107,
                                              cyclistic_202108,
                                              cyclistic_202109,
                                              cyclistic_202110,
                                              cyclistic_202111,
                                              cyclistic_202112,
                                              cyclistic_202201,
                                              cyclistic_202202,
                                              cyclistic_202203,
                                              cyclistic_202204,
                                              cyclistic_202205,
                                              cyclistic_202206)

Checking that the data was combined successfully, the number of rows were compared before and after combinging the data

# to ensure that all rows have been combined the below should return 0

zero_rows = nrow(cyclistic_202107) +  
            nrow(cyclistic_202108) + 
            nrow(cyclistic_202109) + 
            nrow(cyclistic_202110) + 
            nrow(cyclistic_202111) + 
            nrow(cyclistic_202112) + 
            nrow(cyclistic_202201) + 
            nrow(cyclistic_202202) + 
            nrow(cyclistic_202203) + 
            nrow(cyclistic_202204) + 
            nrow(cyclistic_202205) + 
            nrow(cyclistic_202206) -
            nrow(cyclistic)

print(zero_rows)

  >>> [1] 0

There are some unneessar columns that can be removed without impacting the final outcome:

# longitude and latitude columns won't help answering the business question,
# so it is better to remove them and save space and processing power

cyclistic <- cyclistic %>% 
  select(-c(start_lat, start_lng, end_lat, end_lng))

STEP 3: Cleaning and adding data to prepare for analysis

To inspect a newly created table:

colnames(cyclistic)                # List of column names

  >>> [1] "ride_id"            "rideable_type"      "started_at"         "ended_at"           "start_station_name"
      [6] "start_station_id"   "end_station_name"   "end_station_id"     "member_casual"  

nrow(cyclistic)                    # How many rows are in data frame?

  >>> [1] 5900385

dim(cyclistic)                     # Dimensions of the data frame?

  >>> [1] 5900385       9

head(cyclistic)                    # See the first 6 rows of data frame.  Also tail(cyclistic)

  >>> # A tibble: 6 × 9
        ride_id        rideable_type started_at          ended_at            start_station_n… start_station_id end_station_name
        <chr>          <chr>         <dttm>              <dttm>              <chr>            <chr>            <chr>           
      1 0A1B623926EF4… docked_bike   2021-07-02 14:44:36 2021-07-02 15:19:58 Michigan Ave & … 13001            Halsted St & No…
      2 B2D5583A5A5E7… classic_bike  2021-07-07 16:57:42 2021-07-07 17:16:09 California Ave … 17660            Wood St & Hubba…
      3 6F264597DDBF4… classic_bike  2021-07-25 11:30:55 2021-07-25 11:48:45 Wabash Ave & 16… SL-012           Rush St & Hubba…
      4 379B58EAB20E8… classic_bike  2021-07-08 22:08:30 2021-07-08 22:23:32 California Ave … 17660            Carpenter St & …
      5 6615C1E4EB08E… electric_bike 2021-07-28 16:08:06 2021-07-28 16:27:09 California Ave … 17660            Elizabeth (May)…
      6 62DC2B32872F9… electric_bike 2021-07-29 17:09:08 2021-07-29 17:15:00 California Ave … 17660            Albany Ave & Bl…
      # … with 2 more variables: end_station_id <chr>, member_casual <chr>

str(cyclistic)                     # See list of columns and data types (numeric, character, etc)

  >>> tibble [5,900,385 × 9] (S3: tbl_df/tbl/data.frame)
       $ ride_id           : chr [1:5900385] "0A1B623926EF4E16" "B2D5583A5A5E76EE" "6F264597DDBF427A" "379B58EAB20E8AA5" ...
       $ rideable_type     : chr [1:5900385] "docked_bike" "classic_bike" "classic_bike" "classic_bike" ...
       $ started_at        : POSIXct[1:5900385], format: "2021-07-02 14:44:36" "2021-07-07 16:57:42" "2021-07-25 11:30:55" "2021-07-08 22:08:30" ...
       $ ended_at          : POSIXct[1:5900385], format: "2021-07-02 15:19:58" "2021-07-07 17:16:09" "2021-07-25 11:48:45" "2021-07-08 22:23:32" ...
       $ start_station_name: chr [1:5900385] "Michigan Ave & Washington St" "California Ave & Cortez St" "Wabash Ave & 16th St" "California Ave & Cortez St" ...
       $ start_station_id  : chr [1:5900385] "13001" "17660" "SL-012" "17660" ...
       $ end_station_name  : chr [1:5900385] "Halsted St & North Branch St" "Wood St & Hubbard St" "Rush St & Hubbard St" "Carpenter St & Huron St" ...
       $ end_station_id    : chr [1:5900385] "KA1504000117" "13432" "KA1503000044" "13196" ...
       $ member_casual     : chr [1:5900385] "casual" "casual" "member" "member" ...

skim_without_charts(cyclistic)     # Statistical summary of data 

  >>> ── Data Summary ────────────────────────
                                 Values   
      Name                       cyclistic
      Number of rows             5900385  
      Number of columns          9        
      _______________________             
      Column type frequency:              
        character                7        
        POSIXct                  2        
      ________________________            
      Group variables            None     

      ── Variable type: character ─────────────────────────────────────────────────────────────────────────────────────────────
        skim_variable      n_missing complete_rate min max empty n_unique whitespace
      1 ride_id                    0         1      16  16     0  5900385          0
      2 rideable_type              0         1      11  13     0        3          0
      3 start_station_name    836018         0.858   3  64     0     1293          0
      4 start_station_id      836015         0.858   3  44     0     1157          0
      5 end_station_name      892103         0.849   9  64     0     1315          0
      6 end_station_id        892103         0.849   3  44     0     1171          0
      7 member_casual              0         1       6   6     0        2          0

      ── Variable type: POSIXct ───────────────────────────────────────────────────────────────────────────────────────────────
        skim_variable n_missing complete_rate min                 max                 median              n_unique
      1 started_at            0             1 2021-07-01 00:00:22 2022-06-30 23:59:58 2021-10-27 17:35:55  4924385
      2 ended_at              0             1 2021-07-01 00:04:51 2022-07-13 04:21:06 2021-10-27 17:49:46  4924865

Analysis of he above results, we can see that there are:

1- three unique rideable_type

2- many missing station names and id’s

3- two unique values in member_casual

4- the data can only be aggregated at the ride-level, by adding additional columns, such as day, month year, we will have more opportunity to aggregate the data

5- adding ride_length as a calculated field will be beneficial for analysis

By adding columns that list the date, month, day, and year of each ride will allow us to aggregate ride data for each month, day, or year before completing these operations we can only aggregate at the ride level

this link provides more information of data formats in R:

https://www.statmethods.net/input/dates.html

cyclistic$date              <- as.Date(cyclistic$started_at)             # output will be in yyyy-mm-dd
cyclistic$day               <- format(as.Date(cyclistic$date), "%d")     # output will be in range of 01-31
cyclistic$month             <- format(as.Date(cyclistic$date), "%m")     # output will be in range of 01-12
cyclistic$year              <- format(as.Date(cyclistic$date), "%Y")     # output will be in 4-digit year
cyclistic$day_of_week       <- format(as.Date(cyclistic$date), "%A")     # output will be unabbreviated weekday 
cyclistic$year_month        <- format(as.Date(cyclistic$date), "%Y-%m")  # output will be in yyyy-mm and will be used to aggregate by month

now, to add the ride length calculation, we use the difftime() function, this will result in a new column with values in unit of seconds, for more details, check out the below link:

https://stat.ethz.ch/R-manual/R-devel/library/base/html/difftime.html

cyclistic$ride_length <- 
  difftime(cyclistic$ended_at, cyclistic$started_at)   

To check the newly added column, let’s use str() again

>>> str(cyclistic)
    tibble [5,900,385 × 16] (S3: tbl_df/tbl/data.frame)
     $ ride_id           : chr [1:5900385] "0A1B623926EF4E16" "B2D5583A5A5E76EE" "6F264597DDBF427A" "379B58EAB20E8AA5" ...
     $ rideable_type     : chr [1:5900385] "docked_bike" "classic_bike" "classic_bike" "classic_bike" ...
     $ started_at        : POSIXct[1:5900385], format: "2021-07-02 14:44:36" "2021-07-07 16:57:42" "2021-07-25 11:30:55" "2021-07-08 22:08:30" ...
     $ ended_at          : POSIXct[1:5900385], format: "2021-07-02 15:19:58" "2021-07-07 17:16:09" "2021-07-25 11:48:45" "2021-07-08 22:23:32" ...
     $ start_station_name: chr [1:5900385] "Michigan Ave & Washington St" "California Ave & Cortez St" "Wabash Ave & 16th St" "California Ave & Cortez St" ...
     $ start_station_id  : chr [1:5900385] "13001" "17660" "SL-012" "17660" ...
     $ end_station_name  : chr [1:5900385] "Halsted St & North Branch St" "Wood St & Hubbard St" "Rush St & Hubbard St" "Carpenter St & Huron St" ...
     $ end_station_id    : chr [1:5900385] "KA1504000117" "13432" "KA1503000044" "13196" ...
     $ member_casual     : chr [1:5900385] "casual" "casual" "member" "member" ...
     $ date              : Date[1:5900385], format: "2021-07-02" "2021-07-07" "2021-07-25" "2021-07-08" ...
     $ day               : chr [1:5900385] "02" "07" "25" "08" ...
     $ month             : chr [1:5900385] "07" "07" "07" "07" ...
     $ year              : chr [1:5900385] "2021" "2021" "2021" "2021" ...
     $ day_of_week       : chr [1:5900385] "Friday" "Wednesday" "Sunday" "Thursday" ...
     $ year_month        : chr [1:5900385] "2021-07" "2021-07" "2021-07" "2021-07" ...
     $ ride_length       : 'difftime' num [1:5900385] 2122 1107 1070 902 ...
      ..- attr(*, "units")= chr "secs"

Convert “ride_length” from factor to numeric so we can run calculations on the data

is.factor(cyclistic$ride_length)  # returns FALSE
  >>> [1] FALSE

to change from value in seconds to string to numeric value

cyclistic$ride_length <- 
  as.numeric(as.character(cyclistic$ride_length))

to check if the conversion was completed correctly

is.numeric(cyclistic$ride_length) # returns TRUE
  >>> [1] TRUE

Let’s check the statistical summary of the ride_length column

skim_without_charts(cyclistic$ride_length)     # Statistical summary of data 
  >>> ── Data Summary ────────────────────────
                                 Values               
      Name                       cyclistic$ride_length
      Number of rows             5900385              
      Number of columns          1                    
      _______________________                         
      Column type frequency:                          
        numeric                  1                    
      ________________________                        
      Group variables            None                 

      ── Variable type: numeric ───────────────────────────────────────────────────────────────────────────────────────────────
        skim_variable n_missing complete_rate  mean    sd    p0 p25 p50  p75    p100
      1 data                  0             1 1217. 9313. -8245 377 670 1212 2946429

we can see that the minimum value in the ride length column is negative, some of the entries in the data frame were when bikes got taken out of docks and checked for quality, we must remove the negative data but it is best to create a new dataframe

cyclistic_v2 <- cyclistic %>% 
  filter(ride_length > 0)

STEP 4: Conducting descriptive analysis

descriptive analysis on ride_length (all values in seconds)

mean(cyclistic_v2$ride_length)      #straight average (total ride length / rides)
  >>> [1] 1217.109

median(cyclistic_v2$ride_length)    #midpoint number in the ascending array of ride lengths
  >>> [1] 670

max(cyclistic_v2$ride_length)       #longest ride
  >>> [1] 2946429

min(cyclistic_v2$ride_length)       #shortest ride
  >>> [1] 1

using summary(), the four chunks above can be condensed

summary(cyclistic_v2$ride_length)
   >>> Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
          1     377     670    1217    1212 2946429 

it seems that the maximum value is an outlier, and this should be investigated furthur.

to compare members and casual users aggregation fuction was used this link provides examples on how to aggregate:

https://www.statology.org/r-mean-by-group/

average ride length by customer type:

aggregate(cyclistic_v2$ride_length ~ cyclistic_v2$member_casual, FUN = mean)
  >>> cyclistic_v2$member_casual cyclistic_v2$ride_length
    1                     casual                 1789.431
    2                     member                  779.049

median ride length by customer type:

aggregate(cyclistic_v2$ride_length ~ cyclistic_v2$member_casual, FUN = median)
  >>> cyclistic_v2$member_casual cyclistic_v2$ride_length
    1                     casual                      891
    2                     member                      544

max ride length by customer type:

aggregate(cyclistic_v2$ride_length ~ cyclistic_v2$member_casual, FUN = max)
  >>> cyclistic_v2$member_casual cyclistic_v2$ride_length
    1                     casual                  2946429
    2                     member                    93594

min ride length by customer type:

aggregate(cyclistic_v2$ride_length ~ cyclistic_v2$member_casual, FUN = min)
  >>> cyclistic_v2$member_casual cyclistic_v2$ride_length
    1                     casual                        1
    2                     member                        1

to get the average ride time by day of week for members vs casual users, we use the same aggregation function, but add the day of week to the aggregation

aggregate(cyclistic_v2$ride_length, list(cyclistic_v2$member_casual, cyclistic_v2$day_of_week), FUN=mean)

   >>> Group.1   Group.2         x
 casual    Friday 1696.1002
 member    Friday  763.6394
 casual    Monday 1837.7099
 member    Monday  758.2602
 casual  Saturday 1934.4052
 member  Saturday  871.8730
 casual    Sunday 2067.6086
 member    Sunday  881.4162
 casual  Thursday 1634.0557
member  Thursday  747.8117
casual   Tuesday 1544.8222
member   Tuesday  731.5243
casual Wednesday 1544.6104
member Wednesday  732.5412

To organize the days from Sunday to Saturday:

cyclistic_v2$day_of_week <- 
  ordered(cyclistic_v2$day_of_week,
          levels=c(   "Sunday",
                      "Monday",
                      "Tuesday",
                      "Wednesday",
                      "Thursday",
                      "Friday",
                      "Saturday"))

Then running the previos code:

aggregate(cyclistic_v2$ride_length, list(cyclistic_v2$member_casual, cyclistic_v2$day_of_week), FUN=mean)

   >>> Group.1   Group.2         x
 casual    Sunday 2067.6086
 member    Sunday  881.4162
 casual    Monday 1837.7099
 member    Monday  758.2602
 casual   Tuesday 1544.8222
 member   Tuesday  731.5243
 casual Wednesday 1544.6104
 member Wednesday  732.5412
 casual  Thursday 1634.0557
member  Thursday  747.8117
casual    Friday 1696.1002
member    Friday  763.6394
casual  Saturday 1934.4052
member  Saturday  871.8730

we can also run this code for the median, maximum and minum ride lengths:

aggregate(cyclistic_v2$ride_length, list(cyclistic_v2$member_casual, cyclistic_v2$day_of_week), FUN=median)
  
  >>> Group.1   Group.2    x
 casual    Sunday 1035
 member    Sunday  606
 casual    Monday  895
 member    Monday  523
 casual   Tuesday  780
 member   Tuesday  515
 casual Wednesday  779
 member Wednesday  523
 casual  Thursday  792
member  Thursday  528
casual    Friday  847
member    Friday  537
casual  Saturday 1005
member  Saturday  610

aggregate(cyclistic_v2$ride_length, list(cyclistic_v2$member_casual, cyclistic_v2$day_of_week), FUN=max)

   >>> Group.1   Group.2       x
 casual    Sunday 2497750
 member    Sunday   89996
 casual    Monday 1861873
 member    Monday   89997
 casual   Tuesday 1500471
 member   Tuesday   89997
 casual Wednesday 2149238
 member Wednesday   89998
 casual  Thursday 2946429
member  Thursday   89997
casual    Friday 2321116
member    Friday   89998
casual  Saturday 2443476
member  Saturday   93594

aggregate(cyclistic_v2$ride_length, list(cyclistic_v2$member_casual, cyclistic_v2$day_of_week), FUN=min)

   >>> Group.1   Group.2 x
 casual    Sunday 1
 member    Sunday 1
 casual    Monday 1
 member    Monday 1
 casual   Tuesday 1
 member   Tuesday 1
 casual Wednesday 1
 member Wednesday 1
 casual  Thursday 1
member  Thursday 1
casual    Friday 1
member    Friday 1
casual  Saturday 1
member  Saturday 1

The statistical analysis shows that on any given day, casual riders have longer trips than annual memebers.

Let’s create new datasets that show the number of rides and average ride length per day of week and another per month from the start to end of the dataset (2021-07-01 to 2022-06-30) for each customer type:

weekday data:

summarized_data <- cyclistic_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>%    #creates weekday field using wday() True gives day name 
  group_by(member_casual, weekday) %>%                    #groups by usertype and weekday
  summarize(number_of_rides = n()							            #calculates the number of rides and average duration 
            ,average_duration = mean(ride_length)) %>% 		# calculates the average duration
  arrange(member_casual, weekday)							          	# sorts
  
 >>> `summarise()` has grouped output by 'member_casual'. You can override using the `.groups` argument.
 
 print(summarized_data)
 >>> # A tibble: 14 × 4
    # Groups:   member_casual [2]
       member_casual weekday number_of_rides average_duration
       <chr>         <ord>             <int>            <dbl>
     1 casual        Sun              467027            2068.
     2 casual        Mon              303967            1838.
     3 casual        Tue              277738            1545.
     4 casual        Wed              285450            1545.
     5 casual        Thu              325143            1634.
     6 casual        Fri              363065            1696.
     7 casual        Sat              535497            1934.
     8 member        Sun              398881             881.
     9 member        Mon              469011             758.
    10 member        Tue              518207             732.
    11 member        Wed              516507             733.
    12 member        Thu              525901             748.
    13 member        Fri              469221             764.
    14 member        Sat              444124             872.
 

monthly data:

summarized_month <- cyclistic_v2 %>%
  group_by(member_casual, year_month) %>%                      #groups by usertype and month
  summarize(number_of_rides = n()							            #calculates the number of rides and average duration 
            ,average_duration = mean(ride_length)) %>% 		# calculates the average duration
  arrange(member_casual, year_month)							                  	# sorts

>>> `summarise()` has grouped output by 'member_casual'. You can override using the `.groups` argument.

>>> print(summarized_month)
    # A tibble: 24 × 4
    # Groups:   member_casual [2]
       member_casual year_month number_of_rides average_duration
       <chr>         <chr>                <int>            <dbl>
     1 casual        2021-07             442011            1968.
     2 casual        2021-08             412608            1727.
     3 casual        2021-09             363840            1669.
     4 casual        2021-10             257203            1721.
     5 casual        2021-11             106884            1388.
     6 casual        2021-12              69729            1410.
     7 casual        2022-01              18517            1823.
     8 casual        2022-02              21414            1603.
     9 casual        2022-03              89874            1958.
    10 casual        2022-04             126398            1772.
    # … with 14 more rows

we can export these datasets for vizualization and analysis.

STEP 5: Export summarized data for further analysis

by creating a csv file, we can export the data out of R and use it on Tablaue, excel or any other data viz tool

write.csv(summarized_weekday,
          "C:/"Enter your path and file name".csv",
          row.names = TRUE)


write.csv(summarized_month,
          "C:/"Enter your path and file name".csv",
          row.names = TRUE)

Data visualization

To create data visualizations using R, for week day and monthly data, the following codes were used:

summarized_weekday %>% 
  ggplot(aes(x = weekday,                                       # x-axis is weekdays
             y = number_of_rides,                               # y-axis is number of rides
             fill = member_casual)) +                           # fill columns by customer type
  geom_col(position = "dodge") +                                # separate bar for each customer type instead of stacked
  labs(title = "Cyclistic Total Number of Rides per Customer",  # plot title
     subtitle = "From 2021-07-01 to 2022-06-30",                # plot subtitle
     y= "Number of Rides",                                      # y-axis label
     x = "Week Day") +                                          # x-axis label
  geom_text(aes(label = round(number_of_rides,-3)),             # add number of rides to the plot rounded to the nearest thousand
            position = position_dodge(width = 0.9),             # positioning and colouring functions
            vjust = -0.25, color = "black")

ggsave("Number of Rides per Customer-weekday .png")             # save plot as

Number of Rides per Customer-weekday

summarized_weekday %>% 
  ggplot(aes(x = weekday,                                           # x-axis is weekdays
             y = average_duration,                                  # y-axis is number of rides
             fill = member_casual)) +                               # fill columns by customer type
  geom_col(position = "dodge") +                                    # separate bar for each customer type instead of stacked
  labs(title = "Cyclistic Average Ride Duration per Customer Type", # plot title
       subtitle = "From 2021-07-01 to 2022-06-30",                  # plot subtitle
       y= "Average ride duration",                                  # y-axis label
       x = "Week Day") +                                            # x-axis label
  geom_text(aes(label = round(average_duration, 0)),                # add number of rides to the plot rounded to the nearest whole number
            position = position_dodge(width = 0.9),                 # positioning and colouring functions
            vjust = -0.25, color = "black")

ggsave("Average Ride Duration per Customer Type-weekday .png")             # save plot as

Average Ride Duration per Customer Type-weekday

  ggplot(data = summarized_month) +
  geom_col(mapping = aes(x = year_month,
                          y = number_of_rides,
                          fill = member_casual)) +
  labs(title = "Cyclistic Number of Rides per Customer Type", # plot title
       subtitle = "Monthly",                  # plot subtitle
       y = "Number of Rides",
       x = "year-Month") 
  
ggsave("Number of Rides per Customer Type-Monthly .png")             # save plot as

Number of Rides per Customer Type-Monthly

ggplot(data = summarized_month) +
  geom_point(mapping = aes(x = year_month,
                         y = average_duration,
                         color = member_casual,
                         size = 3)) +
  labs(title = "Cyclistic Number of Rides per Customer Type", # plot title
       subtitle = "Monthly",                  # plot subtitle
       y = "Number of Rides",
       x = "year-Month") 

ggsave("Average Ride Duration per Customer Type-Monthly .png")             # save plot as

Average Ride Duration per Customer Type-Monthly

After the analysis and visualization, these are my recomendations to convert casual riders to annual customers:

1- Provide a weekly or monthly subscription trial for casual riders and share with them a visualization on the savings they can make by joining the annual membership, instead of paying per ride.

2- The number of bike rides drops significantly during winter months and starts to go higher by the end of February, this provides a great chance for the company to sart advertising post winter to attract more customers.

3- The data shows that casual customers have a higher ride duration than annaual riders on days of the week and also on monthly basis, therefore the company can set a cap on the length of time they can use the bike brfore being charged an extra amount. This will not apply to annual members and may help in converting customers.

Qasem Al Hijji

Data Science Portfolio