Executative Summary : New York Taxi Ride Map

The dataset is obtained from the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. The data was originally published by the NYC Taxi and Limousine Commission (TLC). This report will be demostrating on what are the popular pickup spots of two different time frame (Morning, Evening) during a day.

Exploratory Data Analysis

The training dataset is downloaded and unzip from Kaggle website, we can see that there are totally 1.45 million rows and 11 columns.

# The URL to download the train.zip is from the kaggle website: 
# website:https://www.kaggle.com/c/nyc-taxi-trip-duration/data/
unzip("train.zip")
trainDS<-c()
if (file.exists("train.bin"))
{
    trainDS<-readRDS("train.bin")
}
if (file.exists("train.csv"))
{
   trainDS<-read.table("train.csv", header=TRUE, sep=",")
   saveRDS(trainDS, file="train.bin")
}
    
dim(trainDS)
## [1] 1458644      11

The columns available are shown below. We will be using the following columns (pickup_datetime, pickup_longitute, pickup_latitude) to demostrate the popular pickup location for the taxi ride during morning time (7 am to 9 am), and evening time (5 pm to 7 pm).

names(trainDS)
##  [1] "id"                 "vendor_id"          "pickup_datetime"   
##  [4] "dropoff_datetime"   "passenger_count"    "pickup_longitude"  
##  [7] "pickup_latitude"    "dropoff_longitude"  "dropoff_latitude"  
## [10] "store_and_fwd_flag" "trip_duration"

Preprocessing data

We first preprocess the data to categorize the pickup time frame of taxi ride. A new column named (“pickup_timeframe”) is added to dataset. Hour between 7 am to 9 am is assgiend to “Morning”, while hour between 5 pm and 7 pm is assigned to “Evening”. We also change the pickup_longtitude and pickup_latitude to numeric for better processing later. For speed up the process, we will limit only the morning /evening data for the analysis later.

finalDS <- trainDS %>%
    select(id, pickup_datetime, pickup_longitude, pickup_latitude) %>%
    mutate(pt = as.POSIXct(strptime(pickup_datetime, format="%Y-%m-%d %H:%M:%S"))) %>%
    mutate(pickup_timeframe=case_when(
                            hour(pt) >= 7 & hour(pt) <=9 ~ "morning",
                            hour(pt) >= 17 & hour(pt) <= 19 ~ "evening",
                            TRUE ~ "other")) %>%
    mutate(pickup_timeframe=as.factor(pickup_timeframe)) %>%
    mutate(pickup_longitude=as.numeric(as.character(pickup_longitude)))%>%
    mutate(pickup_latitude=as.numeric(as.character(pickup_latitude))) %>%
    filter(pickup_timeframe %in% c("morning", "evening"))
dim(finalDS)
## [1] 447707      6
##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 2153024 115.0    6861544 366.5  6525716 348.6
## Vcells 5147223  39.3   64251174 490.2 60706561 463.2