For those of you outside of California, the school system for the city of Los Angeles has a reputation of both being underfunded and also overfunded. Many people argue that we should be spending more per pupil while others cite gross incompetence of the management. A couple months ago, the teachers for LAUSD went on strike demanding more funding and smaller class sizes and management folded after a short while. Eventually LAUSD officials supposedly asked for money from the city’s emergency fund to cover the new expenditures that were agreed to during the strike and the city declined. Measure EE was created to add $6B in funding over 12 years to the school system by taxing property at various rates.
This led to some angry rhetoric on both sides; with figures being thrown around with seemingly no source. I would read through what I could to better understand the situation as I did not read the measure or follow it closely. Eventually I came across the following graph which sort of horrified me:
Not to pick sides, but I tried to explain to them why their analysis is inherently flawed for countless reasons. I wanted to then see if I could replicate it and perform proper analysis.
California actually supplies a lot of data for various departments online and is quite accessible. The issue comes in finding the exact data that the person in question used. I was ultimately unsuccessful, however I did find a nice repository of similiar information here: LAUSD Open Data.
The data sets that I used can be found in the GitLab repository that is linked at the top of this post.
After reading in the data and pairing it using the Funds.Center variable. A Cost Per Pupil was calculated using the cohort size divided by the total aggregated cost for the location. Naturally this isn’t really correct as the cohort likely only represents the population that is eligible to retire, it is simple enough for now until we can find more robust data. We were able to produce our first plot:
Yikes! That doesn’t look too good. Clearly something is wrong here, and in our quest for answers we can try to get rid of some of these problem locations by identifying our outliers.
> head(plot_dat[order(plot_dat$CostPerPupil,decreasing = T),]) CostCenterCode LocationCode CDSCode School 9 1194901 1949 1.964734e+13 Marlton School 6 1194101 1941 1.964734e+13 Benjamin Banneker Career and Transition Center 8 1194801 1948 1.964734e+13 Charles Leroy Lowman Special Education Center 2 1191001 1910 1.964733e+13 Joaquin Miller Career and Transition Center 3 1191401 1914 1.964733e+13 Joseph Pomeroy Widney Career Preparatory & Transition Center 12 1195701 1957 1.964734e+13 Ernest P Willenberg Special Education Center SchoolYear NumberInCohort NumberOfGraduates CohortGradRate Total.Cost CostPerPupil 9 2017-18 11 5 0.455 1454765 132251.36 6 2017-18 16 0 0.000 1388089 86755.56 8 2017-18 15 0 0.000 1249432 83295.47 2 2017-18 35 0 0.000 1752248 50064.23 3 2017-18 40 0 0.000 1914170 47854.25 12 2017-18 35 0 0.000 1413495 40385.57 >
Interesting! As it turns out, the Marlton School is a special school in LAUSD that serves deaf and hard of hearing students. I’m not sure about the remaining schools on the list, but I would imagine that the need for more money makes sense. Using some R magic, we’ll remove all of our outliers and plot again:
It looks like there isn’t much of a relationship here at all. Although we could probably guess, let’s see what we get for a correlation coefficient:
> cor(plot_dat2$CostPerPupil,plot_dat2$CohortGradRate)  -0.1283272
So it doesn’t look like there isn’t any sort of relationship that exists between spending and graduation rates. However, we were able to do our analysis in a reproducible manner and is not some sort of excel beast. It’s not a matter of which way you lean personally on the matter, but incorrect analysis is just as dangerous as good analysis is insightful – even though we get the same result in this instance.
In the future I’d like to take a look at why there is seemingly two clusters. Stay tuned!
# Set Working Directory # Load Graduation Data graduation_dat = read.csv("Data/CohortGradRates20172018.csv",header=T) # Load Fiscal Data fiscal_dat = read.csv("Data/Budget Transparency Detailed Report FN0 2019.csv",header=T) # Explore Which Columns we need names(fiscal_dat) # We'll use "Funds.Center" to help link our fiscal data to our graduation data which shares the same codes. First we will need to aggregate all the fiscal data for each school. fiscal_dat_agg = aggregate(as.numeric(fiscal_dat$Total.Cost), by=list(Category=fiscal_dat$Funds.Center),FUN=sum) # Rename our columns for clarity names(fiscal_dat_agg) = c("CostCenterCode","Total.Cost") # Merge our DFs plot_dat = merge(graduation_dat,fiscal_dat_agg,by="CostCenterCode") # Create Cost Per Pupil plot_dat = cbind(plot_dat,plot_dat$Total.Cost/plot_dat$NumberInCohort) names(plot_dat) = "CostPerPupil" # Plot our information plot(plot_dat$CostPerPupil,plot_dat$CohortGradRate, pch=20) # what are those crazy high values? head(plot_dat[order(plot_dat$CostPerPupil,decreasing = T),]) # Let's get rid of them for now and discuss them later bp_costperpupil = boxplot(plot_dat$CostPerPupil) plot_dat2 = plot_dat[-which(plot_dat$CostPerPupil>bp_costperpupil$stats),] # Plotting again... plot(plot_dat2$CostPerPupil,plot_dat2$CohortGradRate, pch=20, main = "Cost Per Pupil (Derived) verus Cohort Graduation Rate (Reported)", ylab = "Graduation Rate (%)", xlab = "Cost Per Pupil ($USD)" ) # Doesn't look good, but what is our Correlation Coefficient cor(plot_dat2$CostPerPupil,plot_dat2$CohortGradRate)