Tyler Gordon
September 24, 2015
Assignment description: http://eeyore.ucdavis.edu/stat141/Hws/assignment1.html
setwd("~tgordon/Documents/Statistics/141/1/")
load("vehicles.rda")
library(pander) # For printing in markdown
# panderOptions('table.style', 'rmarkdown')
panderOptions('table.style', 'multiline')
panderOptions('table.split.table', Inf)
panderOptions('table.alignment.default', "left")
panderOptions('table.alignment.rownames', "left")
library(RColorBrewer)
pal <- brewer.pal(5, "Set2") # Colors for later plots
pal7 <- brewer.pal(7, "Dark2")
library(maps)
library(lattice)
dim(vposts)
The dataset has 34677 observations.
The variables in the dataset have the following types:
vartypes <- sapply(vposts, class)
vartypes
mean(is.na(vposts$price), na.rm = F) # proportion of NA values
Fewer than 10% of vehicles have no recorded value for price and are omitted.
mean(vposts$price, na.rm = T)
median(vposts$price, na.rm = T)
The mean vehicle price is 49449.9, and the median price is 6700. The large discrepancy between the two suggests skew in the distribution of prices, potentially due to outliers.
set.caption("Deciles of Vehicle Prices")
pander(quantile(vposts$price, na.rm = T, probs = seq(0.1,1,0.1)))
hist(vposts$price, main = "Histogram of Vehicle Prices", xlab = "Prices ($)")
abline(v = mean(vposts$price, na.rm = T), col = "blue", lty = "dashed", lwd = 2)
abline(v = median(vposts$price, na.rm = T), col = "orange", lty = "dotted", lwd = 2)
lines(quantile(vposts$price, na.rm = T, probs = seq(0.1,1,0.1)), col = "green",
lty = "dashed", lwd = 2)
legend("topright", legend=c("Mean","Median","Deciles"),
lty=c("dashed","dotted","dashed"), col=c("blue","orange","green"))
rug(vposts$price, col = "red")
The three outliers visible in the histogram are so large relative to the other values that the mean, median, and all ten deciles are plotted in the same location, a clear indication of a potential problem with these observations.
Because dealing with these observations is the subject of question 8, and is necessary to produce a useful answer to this question, question 8 is included here and this question is continued afterwards.
head(sort(vposts$price, decreasing = T))
The highest price, which occurs twice, is $600030000
. The top three values are each an order of magnitude above the next highest value, which is suspicious.
The two observations with the highest price are examined first:
set.caption("The two observations of highest price")
t(vposts[which(vposts$price == max(vposts$price, na.rm = T)),][c("body", "price")])
From the description in the body, it is clear that the price is not meant to be $600030000
.
Because it not known what price the cars have sold or will be sold for,
if they have been or will be sold at all, the values must be either
removed or replaced with an estimate. I replace both with $18000
,
the average of the given low and high values, under the unsupported
assumption that it is more likely that the cars will be sold for an
intermediate value than one of the extremes.
vposts[which(vposts$price == max(vposts$price, na.rm = T)),]$price = 18000
The new maximum value is $30002500
:
head(sort(vposts$price, decreasing = T))
set.caption("The new observation of highest price")
t(vposts[which(vposts$price == max(vposts$price, na.rm = T)),][c("body", "price")])
The phrases "willing to trade for old school or
truck??????????????????" and "new tires no bends no cracks" strongly
suggest that $30002500
is not the actual value. Unlike the
previous observations, however, the description does not clarify the
intended price of the vehicle. Based on the precedent of the previous
observations, I assume that the poster meant to indicate a range between
$2500
and $3000
and replace the current value 30002500
with 2750
, the average of the assumed range.
vposts[which(vposts$price == max(vposts$price, na.rm = T)),]$price = 7500
The new maximum value is 9999999
:
head(sort(vposts$price, decreasing = T))
set.caption("The new observation of highest price")
t(vposts[which(vposts$price == max(vposts$price, na.rm = T)),]
[c("header", "body", "description", "location", "price")])
This observation is clearly either a joke or an unorthodox attempt at advertising. It is possible that the poster intended to use the incredibly high listed price and the absurdly low price in the description to attract offers, but phrases like "Selling my car for some lunch money", "Comes with complimentary Oboe", and "NEW USED CAR, GOOD BAD CONDITION", and especially the listing of the location as "(EVERYWHERE)" lead me to believe that this post is only a joke. As a result, I believe it is better to remove this observation entirely rather than use an estimate for a vehicle that is not likely to be sold at all.
# Remove the joke Honda observation
vposts <- vposts[ - which(vposts$price == max(vposts$price, na.rm = T)),]
mean(vposts$price, na.rm = T)
median(vposts$price, na.rm = T)
The mean vehicle price, previously 49449.9, is now 9894.919, and the median price, previously 6700, is still 6700.
set.caption("Updated Deciles of Vehicle Prices")
pander(quantile(vposts$price, na.rm = T, probs = seq(0.1,1,0.1)))
With the largest erroneous price values corrected, a slightly better view of the distribution is available:
boxplot(vposts$price,horizontal = T)
abline(v = as.vector(quantile(vposts$price, na.rm = T, probs = seq(0.1,1,0.1))),
col = "green", lty = "dotted")
abline(v = mean(vposts$price, na.rm = T), col = "blue", lty = "dashed")
abline(v = median(vposts$price, na.rm = T), col = "orange", lty = "dashed")
legend("topright", legend=c("Mean","Median","Deciles"), lty=c("dashed","dotted","dashed"),
col=c("blue","orange","green"))
The types of vehicles that are included in the dataset and the proportions of each type are as follows:
sort(prop.table(table(vposts$type)), decreasing = TRUE) == sort(table(vposts$type) / length(vposts$type), decreasing = TRUE)
type.props <- sort(table(vposts$type, useNA = "always") / length(vposts$type), decreasing = TRUE)
names(type.props)[is.na(names(type.props))] <- "NA" # Use string for plots
barp <- barplot(type.props, horiz=F, cex.names = 0.75, ylim = c(0, 0.6), las = 2, srt = 90)
title(main="Proportions of Vehicle Types", ylab="Proportion")
text(x=barp, y=type.props, pos=3, labels = round(type.props,3))
round(type.props, 3)
Nearly half of the observations are missing a value for this
variable. Some vehicles may be difficult to classify, but these can be
represented with the "other" category. Instead, the high proportion of
missing values may be a result of the users of the website not feeling
the need to specify the category of their vehicle when they have already
listed the specific model elsewhere in their posting, as seen in the
other variables title
, body
, and header
.
To test this idea, it might be useful to examine these three other
variables to see if more popular vehicles are more likely to have the
type value omitted, because the posters may assume that users searching
for a car on the website will recognize the types of popular vehicles.
invisible(sapply(levels(vposts$transmission), function(transm.type)
{
trans.subs <- subset(vposts, transmission == transm.type)
fuel.type.table <- table(trans.subs$fuel, trans.subs$type, useNA = "always")
fuel.type.prop.table <- apply(fuel.type.table, 2, function(a.col){
a.col / sum(a.col)
})
barplot(fuel.type.prop.table, ylab="Fuel Type",
main=paste("Transmission: ", transm.type), col=pal, cex.axis=0.5, las=2)
legend("bottomright", legend = levels(trans.subs$fuel), fill = pal, xpd = TRUE)
}))
The plots show that the "other" fuel type is much more popular in vehicles of the "other" transmission type, that "electric" and "hybrid" are less popular fuel types for manual hatchbacks than automatic ones, and that gas is by far the most popular fuel type overall.
length(levels(vposts$city))
7 cities are represented in the dataset.
city.byOwner.table <- table(vposts$city, vposts$byOwner)
colnames(city.byOwner.table) <- c("By Dealer", "By Owner")
city.byOwner.props <- t(apply(city.byOwner.table, 1, function(a.row) a.row / sum(a.row)))
dotplot(city.byOwner.table, horizontal = FALSE, auto.key=TRUE,
main="Number of Vehicle Posts", ylab="Frequency", xlab="City")
dotplot(city.byOwner.props, horizontal = FALSE, auto.key=TRUE,
main="Proportion of Vehicle Posts", ylab="Proportion", xlab="City")
The proportion of posts by dealer is greater than or equal to the
proportion by owner in each city. The differences in proportion are
minimal, however, with the greatest being approximately .02
.
As stated in class, the near equivalence of these proportions is due to
the method of data collection, so the potential inference is limited.
vp.split.city <- split(vposts, vposts$city)
# Creates a matrix of the top 3 vehicle makers for each city
top3.makers <- sapply(vp.split.city, function(city.df){
city.split.byOwner <- split(city.df, city.df$byOwner)
city.byOwner <- city.split.byOwner$"TRUE"
city.byDealer <- city.split.byOwner$"FALSE"
table.byOwner <- sort(table(city.byOwner$maker), decreasing = TRUE)
table.byDealer <- sort(table(city.byDealer$maker), decreasing = TRUE)
# Save only the names of the top 3 makers
top3.byOwner <- names(head(table.byOwner, 3))
top3.byDealer <- names(head(table.byDealer, 3))
top3 <- c(top3.byOwner, top3.byDealer)
names(top3) <- c(rep("By Owner", 3), rep("By Dealer", 3))
return(top3)
})
set.caption("The three most popular makers (most popular first)")
top3.makers
Overall, the top makers are fairly consistent between vehicles sold by owner and by dealer: in each city at least 2 of the top 3 makers are the same, and 6 of the 7 cities have the same top maker.
set.caption("Summary of Vehicle Years")
summary(vposts$year)
The summary of the vehicle years suggests some problems with the minimum and maximum observations.
old.observation <- vposts[vposts$year == min(vposts$year, na.rm = T),
c("title", "header", "year")]
set.caption("The observation of oldest year")
old.observation
The value entered for year is most likely to mean 2004 because of the "04", so I changed it to 2004.
vposts[vposts$year == min(vposts$year, na.rm = T),]$year <- 2004
boxplot(vposts$year, horizontal = T, main="Vehicle Years")
Once again, the minimum value, 1900, appears suspicious, with no observations having years between 1901 and 1920.
old.observation <- vposts[vposts$year == min(vposts$year, na.rm = T),
c("body", "description", "price", "year")]
set.caption("The new observations of oldest year")
# Don't show the duplicates of "posted64811", to save space
old.observation[c("posted30411", "posted64811"),]
(Duplicates of the observation "posted64811" are omitted.)
These observations are not even cars: the first is a literal set of wheels, and the others are offers to buy cars. As a result, I removed them.
vposts <- vposts[ - which(vposts$year == min(vposts$year, na.rm = T)),]
The equally suspicious maximum value, 2022, is examined next.
new.observation <- vposts[vposts$year == max(vposts$year, na.rm = T), c("title", "year")]
set.caption("The observation of newest year")
new.observation
The source of this observation, https://newyork.craigslist.org/que/ctd/5218261938.html, contains an image of the vehicle that shows it closely resembles the 2010 model of the same car as seen on Wikipedia here: https://en.wikipedia.org/wiki/File:2010_Honda_Odyssey_EX_--_12-03-2009.jpg.
Image of vehicle from post | Image of 2010 model from Wikipedia |
---|---|
![]() |
![]() |
From this similarity, it seems likely that the vehicle is actually
from 2011 and that the value 2022 was a mistake, possibly due to the
adjacency of 1
and 2
on keyboards, so I changed the value to 2011.
vposts[vposts$year == max(vposts$year, na.rm = T),]$year <- 2011
summary(vposts$year)
The variable year
now contains only reasonable values.
The boxplots show that the distribution of age differs between vehicles sold by owners and vehicles sold by dealers in every city represented in the dataset. In particular, the median year of a vehicles sold by a dealer is higher that the median year of a vehicles sold by an owner, and there are more older vehicles for sale by owners. This may be because dealers may be less likely to buy older cars to later sell because most older cars are less attractive to customers and less lucrative.
vp.split.city <- split(vposts, vposts$city)
# invisible to supress useless printing
invisible(sapply(names(vp.split.city), function(a.city){
# Subset each data frame in the list by byOwner
city.df <- vp.split.city[[a.city]]
city.byOwner <- subset(city.df, byOwner == "TRUE")
city.byDealer <- subset(city.df, byOwner == "FALSE")
boxplot(city.byOwner$year, city.byDealer$year, horizontal = TRUE,
names = c("By Owner", "By Dealer"), cex=0.7)
title(main=paste("Years of Vehicles in", a.city))
}))
map("state")
with(vposts, points(long, lat, pch=".", col=pal7[factor(vposts$city)]))
legend("bottomright", legend=levels(vposts$city), col = pal7, fill=pal7, xpd=T, cex = 0.7)
Though only seven cities are included in the dataset, the locations of the individual vehicles span nearly the entire country. This may be attributable in part to Craigslist not offering individual sections for every city, causing posts from nearby smaller cities to be included. A potential explanation for the inclusion of posts in locations relatively far from the associated city is that some users might post in the section for a larger but more distant city because it attracts more users, especially if they do not receive any offers after posting in the section for a closer but less popular city. However, this idea does not account for the posts in areas like Florida and Texas, which would almost certainly contain large cities with popular sections on Craigslist that are much closer to the poster. Instead, these posts may reflect the locations of dealers that reside in a city other than the city that contains the dealership.
fuel.props <- sort(table(vposts$fuel, useNA = "always"), decreasing = T) /
length(vposts$fuel)
names(fuel.props)[is.na(names(fuel.props))] <- "NA" # Use string for display
barp <- barplot(fuel.props, cex.names = 1, xlab = "Fuel Type",
main = "Proportion of Vehicles with Fuel Type",
ylab = "Proportion", ylim = c(0,1))
text(x=barp, y=fuel.props, pos=3, labels = round(fuel.props,3))
The plot shows that the vast majority of posted vehicles use gas.
drive.props <- sort(table(vposts$drive, useNA = "always"), decreasing = T) /
length(vposts$drive)
names(drive.props)[is.na(names(drive.props))] <- "NA" # Use string for plots
barp <- barplot(drive.props, cex.names = 1, xlab = "Drive Type",
main = "Proportion of Vehicles with Drive Type",
ylab = "Proportion", ylim = c(0,1))
text(x=barp, y=drive.props, pos=3, labels = round(drive.props,3))
Nearly half of the observations have no value for drive type, so the most popular drive type is not clear.
trans.props <- sort(table(vposts$transmission, useNA = "always"), decreasing = T) /
length(vposts$transmission)
names(trans.props)[is.na(names(trans.props))] <- "NA" # Use string for plots
barp <- barplot(trans.props, cex.names = 1, xlab = "Transmission Type",
main = "Proportion of Vehicles with Transmission Type",
ylab = "Proportion", ylim = c(0,1))
text(x=barp, y=trans.props, pos=3, labels = round(trans.props,3))
The plot shows that automatic transmission is the most popular.
type.props <- sort(table(vposts$type, useNA = "always"), decreasing = T) /
length(vposts$type)
names(type.props)[is.na(names(type.props))] <- "NA" # Use string for plots
barp <- barplot(type.props, cex.names = 1, main = "Proportions of Vehicle Types",
ylab = "Proportion", ylim = c(0,1), las=2)
text(x=barp, y=type.props, pos=3, labels = round(type.props,2))
Though nearly half the values are missing, sedans and SUVs are clearly the most popular vehicle types among vehicles with a listed type.
summary(vposts$odometer)
obs.max.odometer <- vposts[ !is.na(vposts$odometer) & vposts$odometer == max(vposts$odometer,
na.rm = TRUE), c("title", "odometer", "year")]
set.caption("The observation of maximum odometer value")
t(obs.max.odometer)
The maximum odometer value is unreasonably large, and is also just the digits in the sequence they appear on a keyboard, so I replaced the value with NA.
vposts[ !is.na(vposts$odometer) & vposts$odometer == max(vposts$odometer, na.rm = TRUE),
"odometer"] <- NA
obs.max.odometer <- vposts[ !is.na(vposts$odometer) & vposts$odometer == max(vposts$odometer,
na.rm = TRUE), c("title", "odometer", "year")]
set.caption("The new observation of maximum odometer value")
t(obs.max.odometer)
vposts[ !is.na(vposts$odometer) & vposts$odometer == max(vposts$odometer, na.rm = TRUE),
"odometer"] <- NA
obs.max.odometer <- vposts[ !is.na(vposts$odometer) & vposts$odometer == max(vposts$odometer,
na.rm = TRUE), c("title", "odometer", "year")]
set.caption("The new observations of maximum odometer value")
t(obs.max.odometer)
vposts[ !is.na(vposts$odometer) & vposts$odometer == max(vposts$odometer, na.rm = TRUE),
"odometer"] <- NA
obs.max.odometer <- vposts[ !is.na(vposts$odometer) & vposts$odometer == max(vposts$odometer,
na.rm = TRUE), c("title", "description", "odometer")]
set.caption("The new observation of maximum odometer value")
t(obs.max.odometer)
According to https://www.yahoo.com/autos/s/the-first-car-to-3-million-miles-.html?nf=1, a car with one of the highest known odometer values in 2011 was at about 3000000 miles. That vehicle was from 1966, only two years after 1964. It seems highly unlikely that a vehicle from nearly the same year would have three times as many miles, so I changed the value to NA.
vposts[ !is.na(vposts$odometer) & vposts$odometer == max(vposts$odometer, na.rm = TRUE),
"odometer"] <- NA
boxplot(vposts$odometer, horizontal = TRUE, xlab="Odometer")
The boxplot shows that all the remaining odometer values are at least no greater than 3000000.
with(vposts, smoothScatter(year, odometer))
The plot shows that while a few vehicles with about a million miles are from the 50s and 60s, most are actually from the years around 2000. Because it is rare for a vehicle to last so many miles, this may indicate some problems with the values.
quantile(vposts$year, seq(0.1, 1, 0.1))
Because only 10% of the vehicles were made in 1996 or earlier, I define these vehicles to be old.
old.vehicles <- vposts[ !is.na(vposts$year) & vposts$year <= 1996, ]
old.vehicle.maker.props <- sort(table(old.vehicles$maker, useNA = "always") /
length(old.vehicles$maker), decreasing = TRUE)
head(old.vehicle.maker.props, 10)
Chevrolet and Ford are the only makers with more than 10% of the old vehicles, together accounting for more than a third of the vehicles.
boxplot(old.vehicles$price, horizontal = TRUE, main="Prices of Old Vehicles", xlab="Price ($)")
summary(old.vehicles$price)
The distribution of the prices of these vehicles is skewed, with 75% of them priced at 6500 or below, while some vehicles have much higher prices.
One important variable might be whether a vehicle has actually been sold. This data would be useful to determine how the vehicles that are actually sold differ, if at all, from all the vehicles that are posted. This variable could not be deduced from the other variables, and it may not even be possible to obtain it from the site, if vehicles that are sold simmply have their posts closed and made inaccessible.
clean.conditions <- vposts$condition
conditions <- as.vector(sort(table(clean.conditions, useNA = "always") /
length(clean.conditions), decreasing = T))
names(conditions) <- names(sort(table(clean.conditions, useNA = "always"),
decreasing = T))
names(conditions)[is.na(names(conditions))] <- "NA"
barplot(conditions, las=2, cex.names = 0.5, cex.axis = 0.7)
title(main="Vehicle Conditions", ylab="Proportion with Condition")
Many of the conditions were used to describe very few vehicles, often only 1. Because these conditions will not lead to useful conclusions, only the more common conditions will be considered.
top.conditions <- conditions[conditions >= .001]
# Subset with only the top conditions
vp.top.conditions <- vposts[vposts$condition %in% names(top.conditions), ]
bwplot(condition ~ odometer, data = vp.top.conditions,
main="Vehicle Condition and Odometer", xlab = "Condition")
The boxplots show that vehicles with conditions like "excellent", "new", or "like new" tend to have lower odometer values than vehicles with conditions like "fair" or "good". The "certified" condition shows the least range of odometer values, likely due to having compmaratively few observations. Curiously, the "used" condition seems to have relativley low odometer values, with a distribution similar to "like new".