Citi Bike: A Multivariate Analysis of Bicycle Rentals in NYC
Introduction:
Citi Bike is a privately owned bike share program serving a number of cities, including New York City, which is the location of this analysis. The program collects anonymous data on its users and their bike usage, and they make that information available to the public for analysis.
As with any organization, Citi Bike must know how much supply and demand there will be for their service on any day. Therefore, I was interested in creating a predictive model to provide an answer to this problem.
The Datasets:
To perform this analysis, I am combining three datasets. The first collection is from Citi Bike itself, which contains users’ basic customer information and trip histories. There are 16.3 million records over the course of 12 months in this collection, dated from January 1st, 2017 to December 31st, 2017.
The second dataset contains daily weather data obtained from the National Oceanic and Atmospheric Administration (NOAA).
The third dataset is a list of bank holidays and their corresponding dates. Bank holidays are days in which the banks are closed due to a recent holiday, and which are subsequently viewed as public holidays. Note that if a holiday (such as New Year’s Day) falls on a Sunday, the bank holiday would be the following day (Monday). See Table 1 for the full list of variables analyzed in this project. See Table 2 to view a sample of my preprocessed "training" dataset the dataset which will be used as a measure for predicting future bicycle rentals.
A fourth dataset was also obtained from Citi Bike, which includes data on rentals from January 1st, 2018 to November 30th, 2018. December 2018 data was not available at the time of this analysis. This dataset will be used as a "test" set, to assess how accurately the 2017 model will predict rentals in 2018.
Citi Bike is a privately owned bike share program serving a number of cities, including New York City, which is the location of this analysis. The program collects anonymous data on its users and their bike usage, and they make that information available to the public for analysis.
As with any organization, Citi Bike must know how much supply and demand there will be for their service on any day. Therefore, I was interested in creating a predictive model to provide an answer to this problem.
The Datasets:
To perform this analysis, I am combining three datasets. The first collection is from Citi Bike itself, which contains users’ basic customer information and trip histories. There are 16.3 million records over the course of 12 months in this collection, dated from January 1st, 2017 to December 31st, 2017.
The second dataset contains daily weather data obtained from the National Oceanic and Atmospheric Administration (NOAA).
The third dataset is a list of bank holidays and their corresponding dates. Bank holidays are days in which the banks are closed due to a recent holiday, and which are subsequently viewed as public holidays. Note that if a holiday (such as New Year’s Day) falls on a Sunday, the bank holiday would be the following day (Monday). See Table 1 for the full list of variables analyzed in this project. See Table 2 to view a sample of my preprocessed "training" dataset the dataset which will be used as a measure for predicting future bicycle rentals.
A fourth dataset was also obtained from Citi Bike, which includes data on rentals from January 1st, 2018 to November 30th, 2018. December 2018 data was not available at the time of this analysis. This dataset will be used as a "test" set, to assess how accurately the 2017 model will predict rentals in 2018.
Data Cleaning Steps:
After loading the necessary packages in R for data cleaning, analysis, and visualization (ggplot2, plyr, pastecs, ppcor, mctest, and plotrix), I read in all Citi Bike data into 12 separate tables (each table being a month’s worth of rentals). R is a case sensitive language; because some of the columns did not have standardized capitalization across tables, I had to set them all to lowercase before performing a join. I then removed all columns that were not applicable to my analysis. From there, I removed all records where an NA value occurred. I then queried the dataset to show only records where the user type was a subscriber to the program; as there were not many singleday users in the system, I did not want this difference to adversely affect my analysis (Figure 2). I then removed observations where trip durations were under 120 seconds (to account for broken bicycles). I also removed rows where trip durations were over 1 hour to account for stolen bicycles and to remove outliers (as it is not economically justifiable to rent a bicycle for over 45 minutes due to extra fees incurred) (Citi Bike, 2018). These did not occur often in the dataset, so their removal is justified (Figure 3). 
Variable Alias 
Variable Name 
Data Type 
Used in Final Analysis 
Weekday/Weekend 
isWeekend 
Categorical 
X 
Bank Holidays 
holiday 
Categorical 
X 
Precipitation (inches) 
PRCP 
Quantitative 
X 
Average Temperature (°F) 
TAVG 
Quantitative 
X 
Average Wind Speed 
AWND 
Quantitative 
 
Rentals Per Day* 
freq 
Quantitative 
X 
Table 1: Variables used used in the analyses. “Variable Name” lists how I hardcoded the “Variable Alias” in R. *Denotes that “Rentals per Day” is the dependent variable.
I then formatted my date column to match the weather dataset formatting in preparation for dataset merging, and then extracted the weekday for each record. I categorized these weekdays according to “weekday” and “weekend”. Because I wanted to analyze the differences between weekday and weekend rentals in a regression analysis, I converted these factors into integers, with “weekend” valued at 1, and "weekday" valued at 0.
I then loaded the weather data I obtained from NOAA and performed an inner join on the two datasets, merging on the “rental date” column. After the join, I created a new column, which I populated with the average temperature of the day. This was calculated by finding the mean of the low and high temperatures in each observation.
Because I was working with millions of observations, I took a simple random sample of 10,000 records without replacement to improve my computing efficiency.
I then aggregated my records based on their dates. Each record would now represent the number of rentals on a particular day.
Finally, I joined my aggregated dataset with the bank holidays dataset on the “date” columns. All records that contained a holiday were given a value of 1, and those without a holiday were given a value of 0.
I then formatted my date column to match the weather dataset formatting in preparation for dataset merging, and then extracted the weekday for each record. I categorized these weekdays according to “weekday” and “weekend”. Because I wanted to analyze the differences between weekday and weekend rentals in a regression analysis, I converted these factors into integers, with “weekend” valued at 1, and "weekday" valued at 0.
I then loaded the weather data I obtained from NOAA and performed an inner join on the two datasets, merging on the “rental date” column. After the join, I created a new column, which I populated with the average temperature of the day. This was calculated by finding the mean of the low and high temperatures in each observation.
Because I was working with millions of observations, I took a simple random sample of 10,000 records without replacement to improve my computing efficiency.
I then aggregated my records based on their dates. Each record would now represent the number of rentals on a particular day.
Finally, I joined my aggregated dataset with the bank holidays dataset on the “date” columns. All records that contained a holiday were given a value of 1, and those without a holiday were given a value of 0.
Date 
Freq 
AWND 
PRCP 
TAVG 
isWeekday 
Holiday 
20170101 
12 
5.59 
0.00 
44.0 
1 
0.0 
20170102 
8 
9.17 
0.21 
39.0 
0.0 
1 
20170103 
10 
10.74 
0.58 
41.0 
0.0 
0.0 
20170104 
33 
8.05 
0.00 
43.0 
0.0 
0.0 
20170105 
19 
7.83 
0.00 
30.5 
0.0 
0.0 
Table 2: First five records from the dataset. The unit of observation is the number of rentals (Freq) per day. Notice that the independent variable “Holiday” refers to bank holidays. In this case, if a holiday (such as New Year’s Day) falls on a Sunday, the banking system may be closed on the following weekday (such as January 2nd). For full dataset, see appendix.
Descriptive Statistics:
Value 
freq* 
AWND 
PRCP 
TAVG 
isWeekday 
holiday 
nbr.val 
361 
361 
361 
361 
361 
361 
nbr.null 
0.0 
0.0 
242 
0.0 
256 
351 
nbr.na 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
min 
1 
1.12 
0.0 
14.5 
0.0 
0.0 
max 
57 
18.34 
3.03 
85.5 
1 
1 
range 
56 
17.22 
3.03 
71 
1 
1 
sum 
10000 
1880.14 
41.91 
20477 
105 
10 
median 
28 
4.92 
0.0 
57.5 
Value 
0.0 
mean 
27.70083102 
5.208144044 
0.116094183 
56.72299169 
0.290858726 
0.027700831 
SE.mean 
0.647898856 
0.120091085 
0.017951362 
0.878559458 
0.023936254 
0.008649582 
CI.mean 
1.274141992 
0.236168181 
0.035302708 
1.727753471 
0.047072449 
0.017010055 
var 
151.5380271 
5.206294601 
0.116332758 
278.6438866 
0.206832872 
0.02700831 
std.dev 
12.31007827 
2.281730615 
0.341075883 
16.69262971 
0.454788821 
0.164342053 
coef.var 
0.444393826 
0.438108201 
2.937923972 
0.294283309 
1.56360728 
5.932748098 
Table 3: Descriptive statistics. Nbr.val is the number of observations. Nbr.null is the number of cells that contain a zero value. Nbr.na are the number of cells that contain NA values. *Denotes dependent variable.
Value 
PRCP 
TAVG 
AWND 
isWeekend 
holiday 
PRCP 
1 

TAVG 
0.010548 
1 

AWND 
0.190215 
0.50814 
1 

isWeekend 
0.02758 
0.04937 
0.02004 
1 

holiday 
0.0159 
0.05694 
0.086437 
0.1081 
1 
Table 4: Pearson Correlation Matrix. Intervariable correlation is seen under the AWND variable when compared with TAVG and PRCP.
Value 
VIF 
Tolerance 
AWND 
1.4328 
0.6979 
PRCP 
1.0571 
0.9460 
TAVG 
1.3781 
0.7256 
isWeekday 
1.0186 
0.9818 
holiday 
1.0204 
0.9800 
Table 5: Variance inflation factor (VIF) and tolerance table. This tests for multicollinearity. Tolerance less than 0.1 or a VIF of 10 indicates a high degree of multicollinearity.
Justification for the use of Multiple Regression:
I decided to use multiple regression as my statistical technique because I wanted to predict a quantitative value (number of rentals per day) based on a combination of quantitative and categorical variables. Though my sample showed evidence of heteroscedasticity (see Figure 6), I found it could be remedied somewhat through a square root transformation of the dependent variable (see Figure 7). Similarly, the square root transformation also improved linearity between my independent variables and the dependent variables (see Figures 8 – 15). Additionally, among my independent variables I found scant evidence of multicollinearity from the tolerance and VIF techniques, and only found moderate correlation from the AWND variable as it related to the TAVG variables, as seen from the the Pearson Correlation Matrix (see Table 4). Additionally, I found my dataset to be close enough to normal and even more so after the square root transformation to rationalize utilizing the regression method (see Figures 4 and 5).
I decided to use multiple regression as my statistical technique because I wanted to predict a quantitative value (number of rentals per day) based on a combination of quantitative and categorical variables. Though my sample showed evidence of heteroscedasticity (see Figure 6), I found it could be remedied somewhat through a square root transformation of the dependent variable (see Figure 7). Similarly, the square root transformation also improved linearity between my independent variables and the dependent variables (see Figures 8 – 15). Additionally, among my independent variables I found scant evidence of multicollinearity from the tolerance and VIF techniques, and only found moderate correlation from the AWND variable as it related to the TAVG variables, as seen from the the Pearson Correlation Matrix (see Table 4). Additionally, I found my dataset to be close enough to normal and even more so after the square root transformation to rationalize utilizing the regression method (see Figures 4 and 5).
Figure 4: Left  Quantilequantile (QQ) plot as a test for normality. Note how the points do not match up with the regression line well. Figure 5: Right After root transformation, the data improves its normality, though is still skewed at the extremes. This plot was performed on the final regression equation.
Figure 6: Left Residual plot on regression using all independent variables and prior to root transformation. Notice the heteroscedasticity. Right Residual plot on regression using all independent variables. Notice how after root transformation, the residuals move closer to homoscedasticity.
Figure 7: Left Evidence against assumption of linearity when examining temperature to rentals. Rentals appear to decrease at both low and high temperature extremes. Shadowed area around regression line indicates a 95% confidence interval. Right: Linearity improves slightly after a square root transformation on the dependent variable.
Figure 8: Left  Evidence against assumption of linearity when examining wind speed to rentals. Right  Linearity improves marginally after a square root transformation is performed on the dependent variable.
Figure 9: Left  Evidence against linearity. Right  Linearity improves slightly after a square root transformation is performed on the dependent variable.
Figure 10: More rentals occur on weekdays versus weekends. Not all points are shown due to overplotting.
Figure 11: Bank holidays to rentals. Less rentals occur on bank holidays. Not all points are shown due to overplotting.
Analyses and Results:
To analyze each independent variable’s contribution to the strength of the regression line, I created several regression equations, incrementally adding variables to subsequent equations and measuring their impact on the R2 and adjusted R2 values. My final model output and interpretations is as follows:
To analyze each independent variable’s contribution to the strength of the regression line, I created several regression equations, incrementally adding variables to subsequent equations and measuring their impact on the R2 and adjusted R2 values. My final model output and interpretations is as follows:
Slope Estimate 
Std. Error 
tvalue 
Pvalue 

(Intercept) 
2.842370 
0.146923 
19.346 
< 2e16 
TAVG 
0.048033 
0.002404 
19.977 
< 2e16 
PRCP 
0.947616 
0.117355 
8.075 
1.06e14 
isWeekday 
1.073883 
0.088655 
12.113 
< 2e16 
holiday 
1.442509 
0.245367 
5.879 
9.50e09 
R2: 0.6543
Adjusted R2: 0.6504
Model Pvalue: < 2.2e16
Square root transforming the dependent variable improved the adjusted R2 values from 0.632 (not shown) to 0.6504 by reducing heteroscedasticity. This new value implies that the independent variables explain 65% of the variability in Y. The overall model pvalue is extremely statistically significant.
Adjusted R2: 0.6504
Model Pvalue: < 2.2e16
Square root transforming the dependent variable improved the adjusted R2 values from 0.632 (not shown) to 0.6504 by reducing heteroscedasticity. This new value implies that the independent variables explain 65% of the variability in Y. The overall model pvalue is extremely statistically significant.
Figure 12: Actual rentals vs predicted rentals (both square root transformed) for 2018 Citi Bike data, up to November 30th (December data has not been released prior to this analysis). Using the linear model from the previous table, I predicted the square root transformation of bicycle rentals for the year 2018. Difference between actual and predicted values is small, but is accentuated towards the end of the year.
Conclusion:
Ultimately, my final regression model does a good job at predicting bicycle rentals given the available set of independent variables and explains approximately 65% of the variability in the squareroottransformed dependent variable.
In the end, though, I did not use all my available independent variables; I decided to remove the average wind speed variable (AWND) from the final model. Though AWND is significant at the 0.01 level as seen in Table 11, it straddles the threshold, casting doubt on its fit in the model especially given that all other independent variables have extremely low pvalues. And though AWND has a low VIF and high tolerance scores which suggest low combined multicollinearity, it still shows moderatetohigh correlation with the average temperature (TAVG) variable as shown in the Pearson Correlation Matrix. Additionally, its inclusion in the dataset has only marginally increased the adjustedR2 value by less than 0.01. Therefore, I have decided to remove it from my final model.
There remain some limitations in the analysis, however. There were issues of linearity between the quantitative independent variables and the dependent variable that could not be completely resolved through the square root transformation of the dependent variable. This issue of linearity is likely due to, in part, how weather affects rentals; as temperature approaches high extremes, rentals begin to decrease instead of continuing to rise (Figure 8). As can also be seen from Figure 8, there are individuals who will continue to rent from the program system regardless of low temperatures. The addition of new variables into the analysis, such as air pollution or daily pollen measurements, could potentially have improved the predicting power of the model, as well.
Data Sources and References:
Citi Bike. (2018, December 10). Citi Bike. Retrieved from Index of Bucket "Trip Data": https://s3.amazonaws.com/tripdata/index.html
Citi Bike. (2018, December 9). Citi Bike Membership & Pricing. Retrieved from Citi Bike: https://www.citibikenyc.com/pricing
NOAA. (2018, December 10). National Centers for Environmental Information. Retrieved from National Centers for Environmental Information: https://www.ncei.noaa.gov
shivaas. (2018, December 10). CSV for all US bank holidays till 2020. First row is the header. Dates are MYSQL format. Retrieved from Git Hub: https://gist.github.com/shivaas/4758439
Ultimately, my final regression model does a good job at predicting bicycle rentals given the available set of independent variables and explains approximately 65% of the variability in the squareroottransformed dependent variable.
In the end, though, I did not use all my available independent variables; I decided to remove the average wind speed variable (AWND) from the final model. Though AWND is significant at the 0.01 level as seen in Table 11, it straddles the threshold, casting doubt on its fit in the model especially given that all other independent variables have extremely low pvalues. And though AWND has a low VIF and high tolerance scores which suggest low combined multicollinearity, it still shows moderatetohigh correlation with the average temperature (TAVG) variable as shown in the Pearson Correlation Matrix. Additionally, its inclusion in the dataset has only marginally increased the adjustedR2 value by less than 0.01. Therefore, I have decided to remove it from my final model.
There remain some limitations in the analysis, however. There were issues of linearity between the quantitative independent variables and the dependent variable that could not be completely resolved through the square root transformation of the dependent variable. This issue of linearity is likely due to, in part, how weather affects rentals; as temperature approaches high extremes, rentals begin to decrease instead of continuing to rise (Figure 8). As can also be seen from Figure 8, there are individuals who will continue to rent from the program system regardless of low temperatures. The addition of new variables into the analysis, such as air pollution or daily pollen measurements, could potentially have improved the predicting power of the model, as well.
Data Sources and References:
Citi Bike. (2018, December 10). Citi Bike. Retrieved from Index of Bucket "Trip Data": https://s3.amazonaws.com/tripdata/index.html
Citi Bike. (2018, December 9). Citi Bike Membership & Pricing. Retrieved from Citi Bike: https://www.citibikenyc.com/pricing
NOAA. (2018, December 10). National Centers for Environmental Information. Retrieved from National Centers for Environmental Information: https://www.ncei.noaa.gov
shivaas. (2018, December 10). CSV for all US bank holidays till 2020. First row is the header. Dates are MYSQL format. Retrieved from Git Hub: https://gist.github.com/shivaas/4758439