Dataset Overview
The data used in this report was collected as part of the U.S. Census
Bureau’s American Community Survey (ACS), an ongoing survey that
collects social, economic, and housing information. This information is
used by communities, businesses, and the state and federal government to
make decisions about the coming year. The 5-year estimates (2019-2023)
were chosen for this report as they are the most reliable dataset
available for county-level research, especially when it comes to smaller
or rural communities, where 1-year estimates may be less reliable due to
large sampling variability. We combined two data sets, one with information
on median household income, and another with the highest level of
educational attainment, to create a merged dataset where each row
represents a Wisconsin county.
Key Variables
- County (72 counties in Wisconsin)
- Median Household Income
- Educational Attainment (percentage of county residents aged 25+ with
a bachelor’s degree or higher)
Factors Interpreting Results
Five-Year Estimate
The ACS results used in this report reflect a five-year average. This
means they smooth out short-term fluctuations in median income or
educational attainment over the five-year time frame, masking sudden
changes.
Correlation vs. Causation
This analysis identifies correlation, not causation. A strong
relationship between education and income does not necessarily mean that
higher education causes higher income. Median household income is
influenced by many factors, including:
- County Industry
- Cost of Living
- Demographics (Age, Household Composition, Race, Gender)
- Job Market
- Government Policies (Minimum Wage, Tax Policies)
- Inflation
These factors are not included in our model, but may help explain the
remaining variation.
Report Overview
In the rest of this report, we will summarize the distributions of
educational attainment and median household income across counties using
numerical and graphical methods. We will construct a scatterplot of
median household income versus the percent of adults with at least a
bachelor’s degree and overlay a fitted regression line to visualize the
trend. We will then perform a simple linear regression test to determine
whether educational attainment significantly predicts median household
income at the county level.
To find if this relationship is statistically significant, we will
test an alternative hypothesis against the null hypothesis.
The null hypothesis is that the true slope of the line of best fit is
equal to zero, or \[
H_{0}: \beta_{1} = 0
\] And our alternative hypothesis is that the true slope of the
line of best fit is not equal to zero, or \[
H_{a}: \beta_{1} ≠ 0
\]
Models and Graphing
When determining if using linear regression on a data set is
appropriate, we must first confirm that the data is linear in
appearance. In the graph below, we can confirm that the data does appear
to be linear, so we may proceed with regression.

Using R’s built-in LM function, we are able to generate a summary of
almost all of the crucial information we need to determine the effects
the percentage of the population with at least a bachelor’s degree has
on the median household income of a county.
A key component of this summary is the Pr(>|t|) portion of the
pct_bachelors row. This signifies the p-value of the relationship
between the percent of the county population with at least a bachelor’s
degree, and the median household income of the county. Our results gave
us a p-value of 1.68e-12. When using a standard significance level of
0.05, we see that our p-value is well below this value, indicating that
there is enough statistical significance to reject the null hypothesis,
meaning the percentage of the county population with bachelor’s degrees
or higher correlates with the median household income.
A shortcoming of this data is the R-squared value of 0.5047, which
indicates that 50.47% of the variation in the data can be explained by
the percentage of people with at least a bachelor’s degree. While this
isn’t ideal, it does not rule out the use of a linear model.
Another way to verify model correctness is to analyze both residual
plots and QQ plots. Together, these graphs show how closely the data
fits a linear relationship as well as how closely the data fits the
normal model.

The residual plot above shows a relatively random distribution of the
data. Although we see the lower fitted value tends to be more clumped,
and the higher fitted values tend to be more spread out, which can be an
indicator of inaccuracy.
The QQ plot shows a mostly linear trend, indicating the data follows a
normal distribution.
Based on the results of these tests, we can verify that the linear
model is appropriate for this data set, and the assumptions of the model
are valid. This is due to our extremely low p-value, acceptable
R-squared value, and strong QQ and residual vs. fitted plots.