The dataset is obtained from the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. The data was originally published by the NYC Taxi and Limousine Commission (TLC). This report will be demostrating on the taxi trip time prediction from the following predictors : Pickup location, Drop Of Location, and Pickup Datetime.
We have done some exploratory data analysis in the previous report New York Taxi Popular Pickup Location Map.
We will continue to massage the data in a different way in order to predict the trip time of taxi ride. First, we download the data from the kaggle website, and inspect the columns.
## [1] "id" "vendor_id" "pickup_datetime"
## [4] "dropoff_datetime" "passenger_count" "pickup_longitude"
## [7] "pickup_latitude" "dropoff_longitude" "dropoff_latitude"
## [10] "store_and_fwd_flag" "trip_duration"
## [1] "id" "vendor_id" "pickup_datetime"
## [4] "passenger_count" "pickup_longitude" "pickup_latitude"
## [7] "dropoff_longitude" "dropoff_latitude" "store_and_fwd_flag"
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2405122 128.5 3886542 207.6 3205452 171.2
## Vcells 3314474 25.3 5721718 43.7 4583724 35.0
We will select the following columns: pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, pickup_datetime, dropof_datetime as the input variables, and the trip_duration will be our outcome variable.
In order to translate the geospatial data more accurately, we will be adding some new columns:
* Trip Distance: We calculate the trip distance via the distance between pickup / drop off location using the geosphere library (distGeo function), we will update the dataset with an additional column named "trip_distance" stored the distance in meters.
* Trip bearing / haversing. It's showing the trip's arc distance and direction of two geo points.
* Pick up hour : We will derive the pickup hour from the pickup_datetime column using the hour() API.
* Weekday / Weekend / holiday indicator: We will derive a factor variable trip_wkday_ind from the pickup_datetime column. T represent weekdays, F represent Weekends.
pickups <- matrix(c(allDS[,6], allDS[,7]), ncol=2)
dropoffs <- matrix(c(allDS[,8], allDS[,9]), ncol=2)
allDS$trip_distance <- distGeo(pickups, dropoffs)
summary(allDS$trip_distance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1232 2094 3440 3877 1240510
allDS$trip_bearing <- bearing(pickups, dropoffs)
allDS$trip_haversine <- distHaversine(pickups, dropoffs)
allDS <- allDS %>%
filter(!is.na(pickup_datetime)) %>%
mutate(pickup_datetime = as.POSIXct(strptime(pickup_datetime, format="%Y-%m-%d %H:%M:%S"))) %>%
mutate(dropoff_datetime = as.POSIXct(strptime(dropoff_datetime, format="%Y-%m-%d %H:%M:%S"))) %>%
mutate(trip_hour = hour(pickup_datetime)) %>%
mutate(trip_wkday_ind =
ifelse(weekdays(pickup_datetime) %in% c("Saturday", "Sunday"),
"F", "T")) %>%
mutate(trip_date = date(pickup_datetime)) %>%
mutate(trip_duration_computed = as.numeric(difftime(dropoff_datetime, pickup_datetime, units="secs"))) %>%
mutate(trip_month=month(trip_date)) %>%
mutate(trip_wkday=weekdays(trip_date, abbreviate=TRUE))
## Warning: package 'bindrcpp' was built under R version 3.4.2
summary(allDS$trip_duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 397.0 662.0 952.8 1075.0 86392.0 625134
summary(allDS$trip_hour)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 9.00 14.00 13.61 19.00 23.00 1
summary(allDS$trip_wkday_ind)
## Length Class Mode
## 2083774 character character
summary(allDS$trip_speed)
## Length Class Mode
## 0 NULL NULL
When dealing with the popular pickup location map , we found there are some popular pick up location that’s near suburb area like airport. Most of the pickup locations are near Mahattan. We wonder if those pickup/drop off location will affect the speed since the airport is usually off suburburn and the driving speed usually is greater. We will use kmeans to find the clustered pickup/drop off location and add two more additional column named pickup_cluster_id, dropoff_cluster_id as additional predictors.
set.seed(7553)
pickupKMS <- kmeans(cbind(allDS$pickup_longitude, allDS$pickup_latitude), centers=10)
dropKMS <- kmeans(cbind(allDS$dropoff_longitude, allDS$dropoff_latitude), centers=10)
allDS$trip_pickup_cluster <-pickupKMS$cluster
allDS$trip_dropoff_cluster <- dropKMS$cluster
pickupCenters <- as.data.frame(as.matrix(pickupKMS$centers, ncol=2))
pickupCenters$cnt <- pickupKMS$size
names(pickupCenters) <- c("lng", "lat", "cnt")
dropoffCenters <- as.data.frame(as.matrix(dropKMS$centers, ncol=2))
dropoffCenters$cnt <- dropKMS$size
names(dropoffCenters) <- c("lng", "lat", "cnt")
leaflet(pickupCenters) %>%
addTiles() %>%
setView(lng=-73.95214, lat=40.77422, zoom=10) %>%
addCircleMarkers(lng=~lng, lat=~lat,
popup=as.character(pickupCenters$cnt),
popupOptions=popupOptions(riseOnHover=TRUE))
Some month might have snow fall and resulted in long pickup time. We check to see if there is any pattern for the month v.s. average trip duration.
aggrMonthDS <- allDS %>%
filter(!is.na(trip_duration)) %>%
group_by(trip_month) %>%
summarise_at(vars(trip_duration), mean)
plot_ly(aggrMonthDS, x=~trip_month, y=~trip_duration, type="scatter", mode="markers+lines")
Additionally, we also aggregate by the trip_hour and found that the speed has been drastically reduced in traffic hour (hour 8 am to 7 pm), while the speed seems to be the highest in midnight (hour 1 to 4).
aggrHourDS <-allDS %>%
filter(!is.na(trip_duration)) %>%
group_by(trip_hour) %>%
summarise_at(vars(trip_duration), mean)
hourPlot <- plot_ly(data=aggrHourDS, x=~trip_hour, y=~trip_duration, type="scatter", mode="markers+lines")
hourPlot
Finally, we aggregate by the weekday to see if there is any pattern for weekday v.s. weekend.
weekdayDS <- allDS %>%
filter(!is.na(trip_duration)) %>%
group_by(trip_wkday) %>%
summarise_at(vars(trip_duration), mean)
weekdayPlot <- plot_ly(data=weekdayDS, x=~trip_wkday, y=~trip_duration, type="scatter",mode="markers+lines")
weekdayPlot
We will subset 70% of data as training data, 30% of data as cross-validation data.
set.seed(3456)
trainDS <-
allDS %>%
filter(!is.na(trip_duration)) %>%
filter(!is.na(trip_duration_computed))
trainIndex <- createDataPartition(trainDS$trip_distance, p = .70,
list = FALSE,
times = 1)
tripTrain <- trainDS[trainIndex, ]
tripCV <- trainDS[-trainIndex,]
dim(tripTrain)
## [1] 1021048 22
dim(tripCV)
## [1] 437590 22
We fit a linear regression model first and check the residuals. From the anova of the model, we can see the trip_distance and all other factor variables are all significant. The student test of predicted data using the linear model shows the mean of predicted data is 953, while the mean of the training data is 950. The rmlse value is around 0.9. We also plot the residuals v.s. outcome variable. We can see there seems to be a tendency that the bigger the trip_duration, our residuals are bigger.
modelFit <- lm(data=tripTrain, trip_duration_computed ~ trip_distance +
factor(trip_hour) +
factor(trip_wkday_ind) +
factor(trip_pickup_cluster) +
factor(trip_month) +1)
anova(modelFit)
## Analysis of Variance Table
##
## Response: trip_duration_computed
## Df Sum Sq Mean Sq F value
## trip_distance 1 2.2291e+11 2.2291e+11 22505.186
## factor(trip_hour) 23 1.1101e+10 4.8266e+08 48.730
## factor(trip_wkday_ind) 1 9.4628e+08 9.4628e+08 95.539
## factor(trip_pickup_cluster) 9 4.7864e+09 5.3183e+08 53.695
## factor(trip_month) 5 1.2961e+09 2.5922e+08 26.171
## Residuals 1021008 1.0113e+13 9.9047e+06
## Pr(>F)
## trip_distance < 2.2e-16 ***
## factor(trip_hour) < 2.2e-16 ***
## factor(trip_wkday_ind) < 2.2e-16 ***
## factor(trip_pickup_cluster) < 2.2e-16 ***
## factor(trip_month) < 2.2e-16 ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
predictData <- predict(modelFit, newdata=tripCV)
predictData <- as.data.frame(predictData)
t.test(predictData, tripCV$trip_duration_computed)
##
## Welch Two Sample t-test
##
## data: predictData and tripCV$trip_duration_computed
## t = 0.57382, df = 455340, p-value = 0.5661
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.636582 12.131181
## sample estimates:
## mean of x mean of y
## 953.1921 950.4448
rmseInidialModel <- sqrt(1/nrow(tripCV) *
(sum((log2(predictData[,1]+1) - log2(tripCV$trip_duration_computed+1))^2)))
rmseInidialModel
## [1] 0.984994
residualsCV <- resid(modelFit)
plot(x=tripTrain$trip_duration, residualsCV)
We will try split out the data between the dataset that has higher residuals and lower residuals and run different model. We want to know how do we know these rows are residuals, we will run the classification model to categorize the residual categories. We add a new residual Rank variable and update the column with the quantile with the residual for the row.
quantile(modelFit$residuals)
## 0% 25% 50% 75% 100%
## -123803.1451 -369.8356 -194.2295 46.6426 85958.9993
tripTrain$quantileRank <- 1
for (i in 1:4)
{
curRowIndex <- which(modelFit$resid <= quantile(modelFit$residuals)[[i+1]] &
modelFit$resid > quantile(modelFit$residuals)[[i]])
print(paste("Current range:",
as.character(quantile(modelFit$residuals)[[i+1]]),
as.character(quantile(modelFit$residuals)[[i]])))
print(length(curRowIndex))
tripTrain[curRowIndex, "quantileRank"] <- i
}
## [1] "Current range: -369.83559334701 -123803.145099442"
## [1] 255261
## [1] "Current range: -194.229529016227 -369.83559334701"
## [1] 255262
## [1] "Current range: 46.6426028667614 -194.229529016227"
## [1] 255262
## [1] "Current range: 85958.9993074381 46.6426028667614"
## [1] 255262
summary(tripTrain$quantileRank)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.75 2.50 2.50 3.25 4.00
We run the rpart classification model to find what kind of categories will result in such residual differences, we can see that due to the trip distance difference, the residual ranking also different.
set.seed(1548)
results_model <- rpart(data=tripTrain,
quantileRank ~
trip_distance +
as.factor(trip_hour) +
factor(trip_wkday) +
factor(trip_pickup_cluster) +
factor(trip_month),
method="class")
print(results_model)
## n= 1021048
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 1021048 765786 1 (0.25000000 0.25000000 0.25000000 0.25000000)
## 2) trip_distance< 1707.387 408395 242636 1 (0.40587911 0.32607647 0.18769084 0.08035358)
## 4) as.factor(trip_hour)=8,9,10,11,12,13,14,15,16,17,18,22 262584 132858 1 (0.49403619 0.26835984 0.15174192 0.08586205) *
## 5) as.factor(trip_hour)=0,1,2,3,4,5,6,7,19,20,21,23 145811 83110 2 (0.24712127 0.43001557 0.25242951 0.07043364) *
## 3) trip_distance>=1707.387 612653 390207 4 (0.14609085 0.19928736 0.29153534 0.36308645)
## 6) trip_distance< 3176.34 287969 189316 3 (0.13683417 0.26247617 0.34258201 0.25810764) *
## 7) trip_distance>=3176.34 324684 176565 4 (0.15430080 0.14324389 0.24626098 0.45619433) *
It looks like the rush hour and trip distance has some effect on residuals, we will form a categorized trip_distance_cat variable as a new feature. We use the rpart model’s decision tree criteria as the criteria when we factor trip distance.
add_rushhour_ind <- function(tripTrain)
{
tripTrain <- tripTrain %>%
mutate(trip_rush_hour = factor(ifelse(trip_hour %in% c(8:18), "R", "N"))) %>%
mutate(trip_distance_cat =
factor(case_when(
trip_distance < 1707.387 ~ 'DIST1',
trip_distance >=1707.387 & trip_distance < 3176.34 ~ 'DIST2',
trip_distance >= 3176.34 ~ 'DIST3'))) %>%
mutate(trip_cluster_cat =
paste(as.character(trip_pickup_cluster),
"_",
as.character(trip_dropoff_cluster), sep=""))
tripTrain
}
tripTrain <- add_rushhour_ind(tripTrain)
tripCV <- add_rushhour_ind(tripCV)
We are wondering if short distance ride will have different outcome than long distance ride , since usually in short distance ride in down town the speed is much slower than long distance ride. We will subset the data according to the trip_distance.
We will try to use xgb (extreme gradient boost) model and linear model, compare the rmse outcome to see which model will fit the data better. We will run through both model on all three trip distance categories.
s1 <- subset(tripTrain, trip_distance_cat == 'DIST1')
s2 <- subset(tripTrain, trip_distance_cat == 'DIST2')
s3 <- subset(tripTrain, trip_distance_cat == 'DIST3')
v1 <- subset(tripCV, trip_distance_cat == 'DIST1')
v2 <- subset(tripCV, trip_distance_cat == 'DIST2')
v3 <- subset(tripCV, trip_distance_cat == 'DIST3')
make_xgb_matrix<- function(d)
{
sparse_matrix <- sparse.model.matrix(trip_duration~
trip_distance+
trip_rush_hour+
factor(trip_pickup_cluster)+
trip_month+
factor(trip_wkday_ind), data = d)
output_vector <- d$trip_duration
ret <- xgb.DMatrix(data=sparse_matrix, label=output_vector)
ret
}
make_xgb_model_prediction <- function(s, v)
{
print("Boosting..")
sparse_matrix_s <- make_xgb_matrix(s)
output_vector <- s$trip_duration
bst <- xgboost(data = sparse_matrix_s, label = output_vector, max_depth = 4,
eta = 1, nthread = 2, nrounds = 10,
eval.metric = "rmse", objective = "reg:linear")
output_vector <- s$trip_duration
foldsCV <- createFolds(output_vector, k=7, list=TRUE, returnTrain=FALSE)
param <- list(colsample_bytree = 0.7
, booster = "gbtree"
, objective = "reg:linear"
, subsample = 0.7
, max_depth = 5
, eta = 0.037
, eval_metric = 'rmse'
, base_score = 0.012 #average
, seed = 4321)
bst <- xgb.cv(data=sparse_matrix_s,
params=param,
nrounds = 30,
folds=foldsCV,label=output_vector,
prediction=TRUE, nthread = 2,
early_stopping_rounds = 15,print_every_n = 5)
nrounds <- bst$best_iteration
print("training the xgb...")
xgb <- xgb.train(params = param
, data = sparse_matrix_s
, nrounds = nrounds
, verbose = 1
, print_every_n = 5
#, feval = amm_mae
)
sparse_matrix_v <- make_xgb_matrix(v)
print("Predict using the xgb boosting model...")
predictedxgb <- predict(xgb, sparse_matrix_v)
predictedxgb
}
make_linear_model <- function(s, v)
{
print("Linear model ...")
split_model <- lm(data=s,
trip_duration ~
trip_distance +
trip_rush_hour +
factor(trip_pickup_cluster) +
1)
print("Prediction using the linear model ...")
predictv <- predict(split_model, newdata=v)
predictv
}
distance_model <- function(s, v)
{
predictedxgb <- make_xgb_model_prediction(s, v)
xgbRmse <- rmsle(predictedxgb, v1$trip_duration)
predictedlinear <- make_linear_model(s,v)
linearRmse <- rmsle(predictedlinear, v$trip_duration)
print(paste("rmse pair:", as.character(xgbRmse), as.character(linearRmse)))
if (xgbRmse < linearRmse)
print("Choose xgb model!")
else
print("Choose linear model!")
}
Model selected for short distance category is xgb. The rmse is `0.65 v.s. 0.73 for linear model.
distance_model(s1, v1)
## [1] "Boosting.."
## [1] train-rmse:2984.521484
## [2] train-rmse:2979.577148
## [3] train-rmse:2976.620117
## [4] train-rmse:2971.056641
## [5] train-rmse:2965.007324
## [6] train-rmse:2958.332520
## [7] train-rmse:2951.788330
## [8] train-rmse:2948.844971
## [9] train-rmse:2944.350586
## [10] train-rmse:2935.891846
## [1] train-rmse:3035.542864+26.830781 test-rmse:3031.663888+163.985620
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 15 rounds.
##
## [6] train-rmse:3018.967739+26.977943 test-rmse:3017.510638+164.435194
## [11] train-rmse:3007.134452+27.399618 test-rmse:3007.754290+164.680793
## [16] train-rmse:2998.069824+27.037241 test-rmse:3000.996129+164.847286
## [21] train-rmse:2991.151298+27.229089 test-rmse:2996.365792+164.906113
## [26] train-rmse:2985.566267+27.302152 test-rmse:2993.183524+164.933369
## [30] train-rmse:2981.920968+26.928495 test-rmse:2991.368583+164.903738
## [1] "training the xgb..."
## [1] "Predict using the xgb boosting model..."
## [1] "Linear model ..."
## [1] "Prediction using the linear model ..."
## [1] "rmse pair: 0.649752416243766 0.73136677063332"
## [1] "Choose xgb model!"
Model selected for median and long distance categories are xgb. The rmse is 0.47 and 0.43 for linear model v.s. 0.85 v.s. 1.28.
distance_model(s2, v2)
## [1] "Boosting.."
## [1] train-rmse:3073.105713
## [2] train-rmse:3067.100586
## [3] train-rmse:3060.443848
## [4] train-rmse:3051.259521
## [5] train-rmse:3040.682861
## [6] train-rmse:3034.505127
## [7] train-rmse:3025.084229
## [8] train-rmse:3024.862061
## [9] train-rmse:3016.772217
## [10] train-rmse:3013.151123
## [1] train-rmse:3193.683070+19.818856 test-rmse:3191.861747+121.749300
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 15 rounds.
##
## [6] train-rmse:3156.515799+19.916089 test-rmse:3157.417515+122.722019
## [11] train-rmse:3130.253313+19.895432 test-rmse:3133.693324+123.390013
## [16] train-rmse:3110.306815+19.993208 test-rmse:3117.196498+123.853365
## [21] train-rmse:3097.014265+20.762150 test-rmse:3105.821080+124.047905
## [26] train-rmse:3086.384033+21.048207 test-rmse:3097.896380+124.196512
## [30] train-rmse:3080.140276+20.477083 test-rmse:3093.448556+124.214446
## [1] "training the xgb..."
## [1] "Predict using the xgb boosting model..."
## [1] "Linear model ..."
## [1] "Prediction using the linear model ..."
## [1] "rmse pair: 0.864157529177213 0.468970510965031"
## [1] "Choose linear model!"
distance_model(s3, v3)
## [1] "Boosting.."
## [1] train-rmse:3353.267334
## [2] train-rmse:3347.147705
## [3] train-rmse:3341.697266
## [4] train-rmse:3336.235107
## [5] train-rmse:3330.692627
## [6] train-rmse:3328.520752
## [7] train-rmse:3319.812500
## [8] train-rmse:3314.651855
## [9] train-rmse:3306.816406
## [10] train-rmse:3303.613281
## [1] train-rmse:3720.652204+20.263296 test-rmse:3719.175956+118.735096
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 15 rounds.
##
## [6] train-rmse:3611.480259+21.539782 test-rmse:3611.687395+121.134293
## [11] train-rmse:3532.828823+22.686449 test-rmse:3535.041574+122.711975
## [16] train-rmse:3475.814139+23.238142 test-rmse:3480.350760+124.084112
## [21] train-rmse:3435.555873+22.609275 test-rmse:3442.166992+125.462898
## [26] train-rmse:3407.413574+22.546615 test-rmse:3415.771205+126.055140
## [30] train-rmse:3389.891776+22.062070 test-rmse:3400.117955+126.563089
## [1] "training the xgb..."
## [1] "Predict using the xgb boosting model..."
## [1] "Linear model ..."
## [1] "Prediction using the linear model ..."
## [1] "rmse pair: 1.29667561725901 0.432325175060596"
## [1] "Choose linear model!"
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 5675175 303.1 12002346 641 12002346 641.0
## Vcells 120238662 917.4 290978907 2220 290928121 2219.7
For the test data, we will categorize the same trip_distance_cat and choose xgb model for DIST1 (short distance), while using the linear model for DIST2 and DIST3.
allDS <- add_rushhour_ind(allDS)
testDS <- allDS %>%
filter(is.na(dropoff_datetime))
trainDS <- allDS %>%
filter(!is.na(dropoff_datetime))
final_model <- function(testDS)
{
testList <- list()
trainList <- list()
predictionList <- list()
final_prediction <- data.frame()
for (i in 1:3)
{
print("A New iteration ")
trainList[[i]] <- subset(trainDS,
trip_distance_cat == paste('DIST', as.character(i), sep=""))
testList[[i]] <- subset(testDS,
trip_distance_cat == paste('DIST', as.character(i), sep=""))
## reset to 0 to avoid error
testList[[i]]$trip_duration <- 0
if (i == 1)
{
predictionList[[i]] <-
make_xgb_model_prediction(trainList[[i]], testList[[i]])
}
else
{
predictionList[[i]] <-
make_linear_model(trainList[[i]],testList[[i]])
}
# Construct a id <- prediction data frame
currResult <- as.data.frame(cbind(as.character(testList[[i]]$id), predictionList[[i]]))
final_prediction <- as.data.frame(
rbind(final_prediction, currResult))
}
names(final_prediction) <- c("id", "trip_duration")
final_prediction
}
final_prediction <- final_model(testDS)
## [1] "A New iteration "
## [1] "Boosting.."
## [1] train-rmse:2969.889893
## [2] train-rmse:2966.295654
## [3] train-rmse:2964.364258
## [4] train-rmse:2960.876465
## [5] train-rmse:2956.128174
## [6] train-rmse:2952.847168
## [7] train-rmse:2947.901123
## [8] train-rmse:2943.123291
## [9] train-rmse:2937.382812
## [10] train-rmse:2932.567383
## [1] train-rmse:3020.011928+17.067908 test-rmse:3018.615583+103.683028
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 15 rounds.
##
## [6] train-rmse:3004.289341+16.783060 test-rmse:3004.407924+104.038395
## [11] train-rmse:2993.304618+16.920295 test-rmse:2994.636754+104.248072
## [16] train-rmse:2984.909459+16.796478 test-rmse:2987.868757+104.322442
## [21] train-rmse:2978.676653+16.712486 test-rmse:2983.245920+104.324085
## [26] train-rmse:2973.899449+16.837438 test-rmse:2980.002651+104.329616
## [30] train-rmse:2971.202881+16.667993 test-rmse:2978.156843+104.324210
## [1] "training the xgb..."
## [1] "Predict using the xgb boosting model..."
## [1] "A New iteration "
## [1] "Linear model ..."
## [1] "Prediction using the linear model ..."
## [1] "A New iteration "
## [1] "Linear model ..."
## [1] "Prediction using the linear model ..."
write.table(final_prediction, file="prediction_trip_duration.csv",sep=",",
col.names=c("id", "trip_duration"), row.names=FALSE,quote=FALSE)
Finally, we plot the distribution for test dataset’s trip_duration/trip_distance (speed) v.s. training dataset’s log2 probability distribution.
speedDS <- inner_join(final_prediction, testDS, by="id") %>%
dplyr::select(id, trip_duration.x, trip_distance)
speedDSTest <- speedDS %>%
mutate(trip_speed= as.numeric(trip_duration.x) / trip_distance)
speedDSTrain <- trainDS %>%
dplyr::select(id, trip_duration, trip_distance) %>%
dplyr::mutate(trip_speed=as.numeric(trip_duration)/trip_distance)
pTest <- plot_ly(data=speedDSTest, x=~log2(trip_speed), xaxis="x1", type="histogram",histnorm = "probability")
pTrain <- plot_ly(data=speedDSTrain, x=~log2(trip_speed), xaxis="x1", type="histogram",histnorm = "probability")
# subplot
p <- subplot(pTest, pTrain, nrows = 2) %>%
layout(title = "Training v.s. Test Log Speed Probability",
xaxis = list(title = "Test Log2 Trip Speed"),
xaxis2 = list(title= "Train Log2 Trip Speed"),
showlegend=FALSE,showlegend2=FALSE)
p