Thankyou for reading my thread and experts please pardon me if this question seems too silly or an outlier or too lengthy to answer ..but your valuable time is highly appreciated as it would guide me to solve my problem domain (a) and (b).
Attached is description of data set:
A real estate agent is trying to understand the nature of housing stock and
home prices in and around a medium sized town in upstate New York. She
has collected data from a random sample of 1047 homes sold in the last 12
months. Data was collected on the following variables, and is available in the
attached houseprices.csv file.
Price the sale price of the house in $
Living Area in Sq. ft.
Bathrooms number of bathrooms in the house (powder rooms with
no tub or shower area are considered 0.5 baths)
Bedrooms the number of bedrooms
Lot Size size of the property on which the house sits (in acres).
Age of the house in years
Fireplace whether or not the house has a fireplace (Yes = 1, No = 0)
1. Prepare a brief report summarizing the home values (prices) in this area.
Use both graphical and numerical summaries. Your report should briefly
describe what those summaries tell you, and anything of particular
2. Does the normal model provide a good description of the prices? Use a
Normal Quantile plot to frame your response.
3. Irrespective of your response to Q2, assume that Price ~ N(164K, (68K)2).
A. Calculate the following probabilities P(Price > 92.8K), P(Price <
255.5K). Do these numbers agree with what you see in the data?
B. Once again, assuming the above normal distribution, what
percentage of houses should have a value less than 232K? Does that
agree with the data?
C. Based on the theoretical model, what do you expect should be the
price of a house that is exactly on the 3rd quartile (75th percentile,).
How does that compare to the actual?
4. Create a histogram and boxplot for the Living Area variable. What does
the histogram tell you that the boxplot does not, and vice-versa? Is the
distribution symmetric? Check the skewness measure to see if it is
consistent with your observation.
5. Create a new column in the dataset by taking the logarithm of the Living
Area variable. Is the normal distribution a better fit for this variable or the
original (Living Area) variable? Why do you think this is the case?
1. Create the 90%, 95%, and 99% confidence intervals for the average home
price and explain what these mean. How do the margins of error for these
three confidence intervals compare? Does that make sense? Before
creating the confidence intervals, be sure to check the conditions
necessary to create confidence intervals (and briefly describe this in your
2. Your friend has asked you to provide an estimate for the 95th percentile of
home prices in this market. Which (if any) of the above confidence
intervals can you use to give an answer? Describe briefly.
3. The sample data given to you all come from home sales within the past 12
months. Suppose you had sample data of the same size each year going
back several years, and calculated the average sale price for each year.
What kind of distribution do you expect to see for these averages and
why? (Include the parameters of the distribution in your response,
assuming that the house prices dont change i.e. go up or down, over
time. Clearly, this is not a great assumption but make it anyway.)
4. The architecture changed significantly in this geographical area about
30 years ago. So any houses aged more than 30 years are considered
old houses. What proportion of the houses in the sample is old?
Provide the 95% and 99% confidence intervals for the proportion of
old houses in this area, and interpret them. Once again, make sure
that the necessary conditions are satisfied before creating confidence