Executative Summary : New York Taxi Trip Time Prediction

The dataset is obtained from the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. The data was originally published by the NYC Taxi and Limousine Commission (TLC). This report will be demostrating on the taxi trip time prediction from the following predictors : Pickup location, Drop Of Location, and Pickup Datetime.

Exploratory Data Analysis

We have done some exploratory data analysis in the previous report New York Taxi Popular Pickup Location Map.

We will continue to massage the data in a different way in order to predict the trip time of taxi ride. First, we download the data from the kaggle website, and inspect the columns.

##  [1] "id"                 "vendor_id"          "pickup_datetime"   
##  [4] "dropoff_datetime"   "passenger_count"    "pickup_longitude"  
##  [7] "pickup_latitude"    "dropoff_longitude"  "dropoff_latitude"  
## [10] "store_and_fwd_flag" "trip_duration"
## [1] "id"                 "vendor_id"          "pickup_datetime"   
## [4] "passenger_count"    "pickup_longitude"   "pickup_latitude"   
## [7] "dropoff_longitude"  "dropoff_latitude"   "store_and_fwd_flag"
##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 2405122 128.5    3886542 207.6  3205452 171.2
## Vcells 3314474  25.3    5721718  43.7  4583724  35.0

We will select the following columns: pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, pickup_datetime, dropof_datetime as the input variables, and the trip_duration will be our outcome variable.

In order to translate the geospatial data more accurately, we will be adding some new columns:

* Trip Distance: We calculate the trip distance via the distance between pickup / drop off location using the geosphere library (distGeo function), we will update the dataset with an additional column named "trip_distance" stored the distance in meters.

* Trip bearing / haversing. It's showing the trip's arc distance and direction of two geo points.

* Pick up hour : We will derive the pickup hour from the pickup_datetime column using the hour() API.

* Weekday / Weekend / holiday indicator: We will derive a factor variable  trip_wkday_ind from the pickup_datetime column. T represent weekdays, F represent Weekends.
pickups  <- matrix(c(allDS[,6], allDS[,7]), ncol=2)
dropoffs <- matrix(c(allDS[,8], allDS[,9]), ncol=2)
allDS$trip_distance <- distGeo(pickups, dropoffs)
summary(allDS$trip_distance)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    1232    2094    3440    3877 1240510
allDS$trip_bearing <- bearing(pickups, dropoffs)
allDS$trip_haversine <- distHaversine(pickups, dropoffs)

allDS <- allDS %>% 
    filter(!is.na(pickup_datetime)) %>%
    mutate(pickup_datetime = as.POSIXct(strptime(pickup_datetime, format="%Y-%m-%d %H:%M:%S"))) %>%
    mutate(dropoff_datetime = as.POSIXct(strptime(dropoff_datetime, format="%Y-%m-%d %H:%M:%S"))) %>%
    mutate(trip_hour = hour(pickup_datetime)) %>%
    mutate(trip_wkday_ind = 
               ifelse(weekdays(pickup_datetime) %in% c("Saturday", "Sunday"), 
                      "F", "T")) %>% 
    mutate(trip_date = date(pickup_datetime)) %>%
    mutate(trip_duration_computed = as.numeric(difftime(dropoff_datetime, pickup_datetime, units="secs"))) %>%
    mutate(trip_month=month(trip_date)) %>%
    mutate(trip_wkday=weekdays(trip_date, abbreviate=TRUE))
## Warning: package 'bindrcpp' was built under R version 3.4.2
summary(allDS$trip_duration)         
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     1.0   397.0   662.0   952.8  1075.0 86392.0  625134
summary(allDS$trip_hour)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    9.00   14.00   13.61   19.00   23.00       1
summary(allDS$trip_wkday_ind)           
##    Length     Class      Mode 
##   2083774 character character
summary(allDS$trip_speed)
## Length  Class   Mode 
##      0   NULL   NULL

Additional predictor for Pick up / drop off location

When dealing with the popular pickup location map , we found there are some popular pick up location that’s near suburb area like airport. Most of the pickup locations are near Mahattan. We wonder if those pickup/drop off location will affect the speed since the airport is usually off suburburn and the driving speed usually is greater. We will use kmeans to find the clustered pickup/drop off location and add two more additional column named pickup_cluster_id, dropoff_cluster_id as additional predictors.

set.seed(7553)
pickupKMS <- kmeans(cbind(allDS$pickup_longitude, allDS$pickup_latitude), centers=10)
dropKMS <- kmeans(cbind(allDS$dropoff_longitude, allDS$dropoff_latitude), centers=10)
allDS$trip_pickup_cluster <-pickupKMS$cluster
allDS$trip_dropoff_cluster <- dropKMS$cluster

pickupCenters <- as.data.frame(as.matrix(pickupKMS$centers, ncol=2))
pickupCenters$cnt <- pickupKMS$size
names(pickupCenters) <- c("lng", "lat", "cnt")
dropoffCenters <- as.data.frame(as.matrix(dropKMS$centers, ncol=2))
dropoffCenters$cnt <- dropKMS$size
names(dropoffCenters) <- c("lng", "lat", "cnt")


leaflet(pickupCenters) %>%
    addTiles() %>%
    setView(lng=-73.95214, lat=40.77422, zoom=10) %>% 
    addCircleMarkers(lng=~lng, lat=~lat,
               popup=as.character(pickupCenters$cnt),
               popupOptions=popupOptions(riseOnHover=TRUE))

Traffic aggregation for month, hour, and weekday

Some month might have snow fall and resulted in long pickup time. We check to see if there is any pattern for the month v.s. average trip duration.

aggrMonthDS <- allDS %>%
     filter(!is.na(trip_duration)) %>%
     group_by(trip_month) %>%
     summarise_at(vars(trip_duration), mean)
plot_ly(aggrMonthDS, x=~trip_month, y=~trip_duration, type="scatter", mode="markers+lines")

Additionally, we also aggregate by the trip_hour and found that the speed has been drastically reduced in traffic hour (hour 8 am to 7 pm), while the speed seems to be the highest in midnight (hour 1 to 4).

aggrHourDS <-allDS %>%
    filter(!is.na(trip_duration)) %>%
    group_by(trip_hour) %>%
    summarise_at(vars(trip_duration), mean)
hourPlot <- plot_ly(data=aggrHourDS, x=~trip_hour, y=~trip_duration, type="scatter", mode="markers+lines")
    
hourPlot

Finally, we aggregate by the weekday to see if there is any pattern for weekday v.s. weekend.

weekdayDS <- allDS %>%
    filter(!is.na(trip_duration)) %>%
    group_by(trip_wkday) %>%
    summarise_at(vars(trip_duration), mean)
weekdayPlot <- plot_ly(data=weekdayDS, x=~trip_wkday, y=~trip_duration, type="scatter",mode="markers+lines")
    
weekdayPlot

Preprocess Data for modeling

We will subset 70% of data as training data, 30% of data as cross-validation data.

set.seed(3456)
trainDS <- 
    allDS %>% 
    filter(!is.na(trip_duration)) %>%
    filter(!is.na(trip_duration_computed))
trainIndex <- createDataPartition(trainDS$trip_distance, p = .70, 
                                  list = FALSE, 
                                  times = 1)
tripTrain <- trainDS[trainIndex, ]
tripCV <- trainDS[-trainIndex,]
dim(tripTrain)
## [1] 1021048      22
dim(tripCV)
## [1] 437590     22

Regression Model

Initial linear model

We fit a linear regression model first and check the residuals. From the anova of the model, we can see the trip_distance and all other factor variables are all significant. The student test of predicted data using the linear model shows the mean of predicted data is 953, while the mean of the training data is 950. The rmlse value is around 0.9. We also plot the residuals v.s. outcome variable. We can see there seems to be a tendency that the bigger the trip_duration, our residuals are bigger.

modelFit <- lm(data=tripTrain, trip_duration_computed ~ trip_distance +
        factor(trip_hour) +
        factor(trip_wkday_ind) +
        factor(trip_pickup_cluster) +
        factor(trip_month) +1)
anova(modelFit)
## Analysis of Variance Table
## 
## Response: trip_duration_computed
##                                  Df     Sum Sq    Mean Sq   F value
## trip_distance                     1 2.2291e+11 2.2291e+11 22505.186
## factor(trip_hour)                23 1.1101e+10 4.8266e+08    48.730
## factor(trip_wkday_ind)            1 9.4628e+08 9.4628e+08    95.539
## factor(trip_pickup_cluster)       9 4.7864e+09 5.3183e+08    53.695
## factor(trip_month)                5 1.2961e+09 2.5922e+08    26.171
## Residuals                   1021008 1.0113e+13 9.9047e+06          
##                                Pr(>F)    
## trip_distance               < 2.2e-16 ***
## factor(trip_hour)           < 2.2e-16 ***
## factor(trip_wkday_ind)      < 2.2e-16 ***
## factor(trip_pickup_cluster) < 2.2e-16 ***
## factor(trip_month)          < 2.2e-16 ***
## Residuals                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
predictData <- predict(modelFit, newdata=tripCV)
predictData <- as.data.frame(predictData)
t.test(predictData, tripCV$trip_duration_computed)
## 
##  Welch Two Sample t-test
## 
## data:  predictData and tripCV$trip_duration_computed
## t = 0.57382, df = 455340, p-value = 0.5661
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.636582 12.131181
## sample estimates:
## mean of x mean of y 
##  953.1921  950.4448
rmseInidialModel <- sqrt(1/nrow(tripCV) *
 (sum((log2(predictData[,1]+1) - log2(tripCV$trip_duration_computed+1))^2)))
rmseInidialModel
## [1] 0.984994
residualsCV <- resid(modelFit)
plot(x=tripTrain$trip_duration,  residualsCV)

Residual Plot Diagnosis

We will try split out the data between the dataset that has higher residuals and lower residuals and run different model. We want to know how do we know these rows are residuals, we will run the classification model to categorize the residual categories. We add a new residual Rank variable and update the column with the quantile with the residual for the row.

quantile(modelFit$residuals)
##           0%          25%          50%          75%         100% 
## -123803.1451    -369.8356    -194.2295      46.6426   85958.9993
tripTrain$quantileRank <- 1
for (i in 1:4)
{
    curRowIndex <- which(modelFit$resid <= quantile(modelFit$residuals)[[i+1]] &
                             modelFit$resid > quantile(modelFit$residuals)[[i]])
    
    print(paste("Current range:", 
                as.character(quantile(modelFit$residuals)[[i+1]]),
                as.character(quantile(modelFit$residuals)[[i]])))
    print(length(curRowIndex))
    tripTrain[curRowIndex, "quantileRank"]  <- i  
}
## [1] "Current range: -369.83559334701 -123803.145099442"
## [1] 255261
## [1] "Current range: -194.229529016227 -369.83559334701"
## [1] 255262
## [1] "Current range: 46.6426028667614 -194.229529016227"
## [1] 255262
## [1] "Current range: 85958.9993074381 46.6426028667614"
## [1] 255262
summary(tripTrain$quantileRank)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.75    2.50    2.50    3.25    4.00

Add Rush hour and trip distance feature

We run the rpart classification model to find what kind of categories will result in such residual differences, we can see that due to the trip distance difference, the residual ranking also different.

set.seed(1548)
results_model <- rpart(data=tripTrain,
                          quantileRank ~ 
                           trip_distance +
                           as.factor(trip_hour) +
                           factor(trip_wkday) +
                           factor(trip_pickup_cluster) +
                           factor(trip_month),
                     method="class")
print(results_model)    
## n= 1021048 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 1021048 765786 1 (0.25000000 0.25000000 0.25000000 0.25000000)  
##   2) trip_distance< 1707.387 408395 242636 1 (0.40587911 0.32607647 0.18769084 0.08035358)  
##     4) as.factor(trip_hour)=8,9,10,11,12,13,14,15,16,17,18,22 262584 132858 1 (0.49403619 0.26835984 0.15174192 0.08586205) *
##     5) as.factor(trip_hour)=0,1,2,3,4,5,6,7,19,20,21,23 145811  83110 2 (0.24712127 0.43001557 0.25242951 0.07043364) *
##   3) trip_distance>=1707.387 612653 390207 4 (0.14609085 0.19928736 0.29153534 0.36308645)  
##     6) trip_distance< 3176.34 287969 189316 3 (0.13683417 0.26247617 0.34258201 0.25810764) *
##     7) trip_distance>=3176.34 324684 176565 4 (0.15430080 0.14324389 0.24626098 0.45619433) *

It looks like the rush hour and trip distance has some effect on residuals, we will form a categorized trip_distance_cat variable as a new feature. We use the rpart model’s decision tree criteria as the criteria when we factor trip distance.

add_rushhour_ind <- function(tripTrain) 
{
    
    tripTrain <- tripTrain %>%
        mutate(trip_rush_hour = factor(ifelse(trip_hour %in% c(8:18), "R", "N"))) %>%
        mutate(trip_distance_cat = 
               factor(case_when(
                   trip_distance < 1707.387 ~ 'DIST1',
                   trip_distance >=1707.387 & trip_distance < 3176.34  ~ 'DIST2',
                   trip_distance >= 3176.34 ~ 'DIST3'))) %>%
        mutate(trip_cluster_cat = 
                   paste(as.character(trip_pickup_cluster), 
                         "_",
                         as.character(trip_dropoff_cluster), sep=""))
    tripTrain
}

tripTrain <- add_rushhour_ind(tripTrain)
tripCV <- add_rushhour_ind(tripCV)

Subset short distance and long distance data to choose different model

We are wondering if short distance ride will have different outcome than long distance ride , since usually in short distance ride in down town the speed is much slower than long distance ride. We will subset the data according to the trip_distance.

We will try to use xgb (extreme gradient boost) model and linear model, compare the rmse outcome to see which model will fit the data better. We will run through both model on all three trip distance categories.

s1 <- subset(tripTrain, trip_distance_cat == 'DIST1')
s2 <- subset(tripTrain, trip_distance_cat == 'DIST2')
s3 <- subset(tripTrain, trip_distance_cat == 'DIST3')


v1 <- subset(tripCV, trip_distance_cat == 'DIST1')
v2 <- subset(tripCV, trip_distance_cat == 'DIST2')
v3 <- subset(tripCV, trip_distance_cat == 'DIST3')
make_xgb_matrix<- function(d)
{
    sparse_matrix <- sparse.model.matrix(trip_duration~
                                             trip_distance+
                                             trip_rush_hour+
                                             factor(trip_pickup_cluster)+
                                             trip_month+
                                             factor(trip_wkday_ind), data = d)
    output_vector <- d$trip_duration
    ret <- xgb.DMatrix(data=sparse_matrix, label=output_vector)
    ret
}
make_xgb_model_prediction <- function(s, v)
{
    print("Boosting..")
    sparse_matrix_s <- make_xgb_matrix(s)
    output_vector <- s$trip_duration
    bst <- xgboost(data = sparse_matrix_s, label = output_vector, max_depth = 4,
                   eta = 1, nthread = 2, nrounds = 10,
                   eval.metric = "rmse", objective = "reg:linear")
    output_vector <- s$trip_duration
    foldsCV <- createFolds(output_vector, k=7, list=TRUE, returnTrain=FALSE)
    
    param <- list(colsample_bytree = 0.7
                  , booster = "gbtree"
                  , objective = "reg:linear"
                  , subsample = 0.7
                  , max_depth = 5
                  , eta = 0.037
                  , eval_metric = 'rmse'
                  , base_score = 0.012 #average
                  , seed = 4321)
    
    bst <- xgb.cv(data=sparse_matrix_s,
                  params=param, 
                  nrounds = 30,
                  folds=foldsCV,label=output_vector,
           prediction=TRUE, nthread = 2,
            early_stopping_rounds = 15,print_every_n = 5)
    
    nrounds <- bst$best_iteration
    
    
    print("training the xgb...")
    xgb <- xgb.train(params = param
                     , data = sparse_matrix_s
                     , nrounds = nrounds
                     , verbose = 1
                     , print_every_n = 5
                     #, feval = amm_mae
                    )
    
    sparse_matrix_v <- make_xgb_matrix(v)
    
    print("Predict using the xgb boosting model...")
    predictedxgb <- predict(xgb, sparse_matrix_v)
    predictedxgb
}
make_linear_model <- function(s, v)
{
    print("Linear model ...")
    split_model <- lm(data=s,
                       trip_duration ~ 
                           trip_distance +
                           trip_rush_hour +
                           factor(trip_pickup_cluster) + 
                           1)
    print("Prediction using the linear model ...")
    predictv <- predict(split_model, newdata=v)
    predictv
}
distance_model <- function(s, v)
{
    predictedxgb <- make_xgb_model_prediction(s, v)
    xgbRmse <- rmsle(predictedxgb, v1$trip_duration)
    predictedlinear <- make_linear_model(s,v)
    linearRmse <- rmsle(predictedlinear, v$trip_duration)
    print(paste("rmse pair:", as.character(xgbRmse), as.character(linearRmse)))
    if (xgbRmse < linearRmse)
        print("Choose xgb model!")
    else
        print("Choose linear model!")
    
}

Model selected for short distance category is xgb. The rmse is `0.65 v.s. 0.73 for linear model.

distance_model(s1, v1)
## [1] "Boosting.."
## [1]  train-rmse:2984.521484 
## [2]  train-rmse:2979.577148 
## [3]  train-rmse:2976.620117 
## [4]  train-rmse:2971.056641 
## [5]  train-rmse:2965.007324 
## [6]  train-rmse:2958.332520 
## [7]  train-rmse:2951.788330 
## [8]  train-rmse:2948.844971 
## [9]  train-rmse:2944.350586 
## [10] train-rmse:2935.891846 
## [1]  train-rmse:3035.542864+26.830781    test-rmse:3031.663888+163.985620 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 15 rounds.
## 
## [6]  train-rmse:3018.967739+26.977943    test-rmse:3017.510638+164.435194 
## [11] train-rmse:3007.134452+27.399618    test-rmse:3007.754290+164.680793 
## [16] train-rmse:2998.069824+27.037241    test-rmse:3000.996129+164.847286 
## [21] train-rmse:2991.151298+27.229089    test-rmse:2996.365792+164.906113 
## [26] train-rmse:2985.566267+27.302152    test-rmse:2993.183524+164.933369 
## [30] train-rmse:2981.920968+26.928495    test-rmse:2991.368583+164.903738 
## [1] "training the xgb..."
## [1] "Predict using the xgb boosting model..."
## [1] "Linear model ..."
## [1] "Prediction using the linear model ..."
## [1] "rmse pair: 0.649752416243766 0.73136677063332"
## [1] "Choose xgb model!"

Model selected for median and long distance categories are xgb. The rmse is 0.47 and 0.43 for linear model v.s. 0.85 v.s. 1.28.

distance_model(s2, v2)
## [1] "Boosting.."
## [1]  train-rmse:3073.105713 
## [2]  train-rmse:3067.100586 
## [3]  train-rmse:3060.443848 
## [4]  train-rmse:3051.259521 
## [5]  train-rmse:3040.682861 
## [6]  train-rmse:3034.505127 
## [7]  train-rmse:3025.084229 
## [8]  train-rmse:3024.862061 
## [9]  train-rmse:3016.772217 
## [10] train-rmse:3013.151123 
## [1]  train-rmse:3193.683070+19.818856    test-rmse:3191.861747+121.749300 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 15 rounds.
## 
## [6]  train-rmse:3156.515799+19.916089    test-rmse:3157.417515+122.722019 
## [11] train-rmse:3130.253313+19.895432    test-rmse:3133.693324+123.390013 
## [16] train-rmse:3110.306815+19.993208    test-rmse:3117.196498+123.853365 
## [21] train-rmse:3097.014265+20.762150    test-rmse:3105.821080+124.047905 
## [26] train-rmse:3086.384033+21.048207    test-rmse:3097.896380+124.196512 
## [30] train-rmse:3080.140276+20.477083    test-rmse:3093.448556+124.214446 
## [1] "training the xgb..."
## [1] "Predict using the xgb boosting model..."
## [1] "Linear model ..."
## [1] "Prediction using the linear model ..."
## [1] "rmse pair: 0.864157529177213 0.468970510965031"
## [1] "Choose linear model!"
distance_model(s3, v3)
## [1] "Boosting.."
## [1]  train-rmse:3353.267334 
## [2]  train-rmse:3347.147705 
## [3]  train-rmse:3341.697266 
## [4]  train-rmse:3336.235107 
## [5]  train-rmse:3330.692627 
## [6]  train-rmse:3328.520752 
## [7]  train-rmse:3319.812500 
## [8]  train-rmse:3314.651855 
## [9]  train-rmse:3306.816406 
## [10] train-rmse:3303.613281 
## [1]  train-rmse:3720.652204+20.263296    test-rmse:3719.175956+118.735096 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 15 rounds.
## 
## [6]  train-rmse:3611.480259+21.539782    test-rmse:3611.687395+121.134293 
## [11] train-rmse:3532.828823+22.686449    test-rmse:3535.041574+122.711975 
## [16] train-rmse:3475.814139+23.238142    test-rmse:3480.350760+124.084112 
## [21] train-rmse:3435.555873+22.609275    test-rmse:3442.166992+125.462898 
## [26] train-rmse:3407.413574+22.546615    test-rmse:3415.771205+126.055140 
## [30] train-rmse:3389.891776+22.062070    test-rmse:3400.117955+126.563089 
## [1] "training the xgb..."
## [1] "Predict using the xgb boosting model..."
## [1] "Linear model ..."
## [1] "Prediction using the linear model ..."
## [1] "rmse pair: 1.29667561725901 0.432325175060596"
## [1] "Choose linear model!"
##             used  (Mb) gc trigger (Mb)  max used   (Mb)
## Ncells   5675175 303.1   12002346  641  12002346  641.0
## Vcells 120238662 917.4  290978907 2220 290928121 2219.7

Final Kernel

For the test data, we will categorize the same trip_distance_cat and choose xgb model for DIST1 (short distance), while using the linear model for DIST2 and DIST3.

allDS <- add_rushhour_ind(allDS)
testDS <- allDS %>%
    filter(is.na(dropoff_datetime))
trainDS <- allDS %>%
    filter(!is.na(dropoff_datetime))
final_model <- function(testDS)
{
    testList <- list()
    trainList <- list()
    predictionList <- list()
    final_prediction <- data.frame()
    for (i in 1:3) 
    {
        print("A New iteration ")
        trainList[[i]] <- subset(trainDS, 
                    trip_distance_cat == paste('DIST', as.character(i), sep=""))
        testList[[i]] <- subset(testDS, 
                    trip_distance_cat == paste('DIST', as.character(i), sep=""))
        ## reset to 0 to avoid error
        testList[[i]]$trip_duration <- 0
        if (i == 1)
        {
            predictionList[[i]] <- 
                 make_xgb_model_prediction(trainList[[i]], testList[[i]])
        }
        else
        {
            predictionList[[i]]  <-
                make_linear_model(trainList[[i]],testList[[i]])
        }
        # Construct a id <- prediction data frame
        currResult <- as.data.frame(cbind(as.character(testList[[i]]$id), predictionList[[i]]))
        final_prediction <- as.data.frame(
            rbind(final_prediction, currResult))
        
    }
    names(final_prediction) <- c("id", "trip_duration")
    final_prediction
}
final_prediction <- final_model(testDS) 
## [1] "A New iteration "
## [1] "Boosting.."
## [1]  train-rmse:2969.889893 
## [2]  train-rmse:2966.295654 
## [3]  train-rmse:2964.364258 
## [4]  train-rmse:2960.876465 
## [5]  train-rmse:2956.128174 
## [6]  train-rmse:2952.847168 
## [7]  train-rmse:2947.901123 
## [8]  train-rmse:2943.123291 
## [9]  train-rmse:2937.382812 
## [10] train-rmse:2932.567383 
## [1]  train-rmse:3020.011928+17.067908    test-rmse:3018.615583+103.683028 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 15 rounds.
## 
## [6]  train-rmse:3004.289341+16.783060    test-rmse:3004.407924+104.038395 
## [11] train-rmse:2993.304618+16.920295    test-rmse:2994.636754+104.248072 
## [16] train-rmse:2984.909459+16.796478    test-rmse:2987.868757+104.322442 
## [21] train-rmse:2978.676653+16.712486    test-rmse:2983.245920+104.324085 
## [26] train-rmse:2973.899449+16.837438    test-rmse:2980.002651+104.329616 
## [30] train-rmse:2971.202881+16.667993    test-rmse:2978.156843+104.324210 
## [1] "training the xgb..."
## [1] "Predict using the xgb boosting model..."
## [1] "A New iteration "
## [1] "Linear model ..."
## [1] "Prediction using the linear model ..."
## [1] "A New iteration "
## [1] "Linear model ..."
## [1] "Prediction using the linear model ..."
write.table(final_prediction, file="prediction_trip_duration.csv",sep=",",
          col.names=c("id", "trip_duration"), row.names=FALSE,quote=FALSE)

Finally, we plot the distribution for test dataset’s trip_duration/trip_distance (speed) v.s. training dataset’s log2 probability distribution.

speedDS <- inner_join(final_prediction, testDS, by="id") %>%
    dplyr::select(id, trip_duration.x, trip_distance)
speedDSTest <- speedDS %>%
    mutate(trip_speed= as.numeric(trip_duration.x) / trip_distance)

speedDSTrain <- trainDS %>%
    dplyr::select(id, trip_duration, trip_distance) %>%
    dplyr::mutate(trip_speed=as.numeric(trip_duration)/trip_distance)

pTest <- plot_ly(data=speedDSTest, x=~log2(trip_speed), xaxis="x1", type="histogram",histnorm = "probability") 
pTrain <- plot_ly(data=speedDSTrain, x=~log2(trip_speed), xaxis="x1", type="histogram",histnorm = "probability")
     
# subplot
p <- subplot(pTest, pTrain, nrows = 2) %>%
  layout(title = "Training v.s. Test Log Speed Probability",
         xaxis = list(title = "Test Log2 Trip Speed"),
         xaxis2 = list(title= "Train Log2 Trip Speed"),
         showlegend=FALSE,showlegend2=FALSE)

p