Edit or run this notebook

Salaries in the city of Norfolk municipality

A case study of hierarchical Bayesian modeling.

1.7 ms
21.1 s

Data

The data is a list of employees. We consider here the salary as the random variable of interest and department and employee status as explanatory variables.

12.7 Î¼s
df
Department Position TitleEmployee StatusInitial Hire DateDate in PositionFair Labor Standards Act (FLSA) Base Salary
StringStringStringStringStringStringFloat64
1
"CF - MacArthur Memorial"
"Museum Attendant"
"Casual Part-time"
"06/04/2007"
"06/04/2007"
"Nonexempt"
12.51
2
"CF - MacArthur Memorial"
"Museum Attendant"
"Casual Part-time"
"11/13/2006"
"11/13/2006"
"Nonexempt"
12.51
3
"CF - MacArthur Memorial"
"Museum Attendant"
"Permanent Full-time"
"09/23/2020"
"09/23/2020"
"Nonexempt"
26200.0
4
"CF - MacArthur Memorial"
"Museum Attendant"
"Permanent Full-time"
"01/07/2019"
"01/07/2019"
"Nonexempt"
26200.0
5
"CF - MacArthur Memorial"
"Administrative Technician"
"Permanent Full-time"
"09/23/2015"
"03/19/2018"
"Nonexempt"
33454.1
6
"CF - MacArthur Memorial"
"Curator"
"Permanent Full-time"
"11/03/2014"
"11/03/2014"
"Exempt"
55193.6
7
"CF - MacArthur Memorial"
"Education Manager"
"Permanent Full-time"
"10/29/2007"
"11/23/2009"
"Exempt"
59534.6
8
"CF - MacArthur Memorial"
"Archivist"
"Permanent Full-time"
"06/01/1994"
"05/15/1996"
"Exempt"
61718.1
9
"CF-Cultural&Convention Center"
"Ticket Seller"
"Intermittent Temporary"
"12/30/2004"
"12/30/2004"
"Nonexempt"
11.91
10
"CF-Cultural&Convention Center"
"Ticket Sales Supervisor"
"Intermittent Temporary"
"10/18/2002"
"10/18/2002"
"Nonexempt"
14.59
more
4399
"Zoo-Veterinary & Wellness Camp"
"Veterinarian"
"Permanent Full-time"
"09/26/2020"
"09/26/2020"
"Exempt"
75000.0
17.6 s

Some of the salaries are per hour, while others are per year. Fortunately, the highest hourly rate is much lower than the lowest yearly salary, hence we can ‘fix’ the data by scaling up hourly rates by the number of work hours per year (≈ 2000).

We find the threshold separating hourly and yearly salaries as the geometric mean of the two ends of the longest interval between subsequent sorted salaries:

6.6 Î¼s
731.0526656814815
185 ms
26.0 s

With hourly salaries multiplied by the number of work hours in a year, the distribution of salaries is unimodel.

6.1 Î¼s
170 ms
210 ms

One can see from the plots that the distribution of all salaries looks more like a Gauss' bell on the log scale. Because of that, we log-transform the data. Afterwards, we standardize the data to zero mean and unit variance.

3.4 Î¼s
49.2 ms

For convenience, the data is transformed into three arrays:

  • salary –- the target variable,

  • department –- an explanatory variable,

  • status –- employee status, another explanatory variable.

We collect the data in two forms: with dependence only on the department, and on both the department and the employee status.

2.8 ms
38.9 ms
84.3 Î¼s
97.0 ns

Models

We define two models, model_by_dept and model_by_dept_stat. model_by_dept is conditioned only on the department. model_by_dept_stat is conditioned on both the department and on employee status.

30.6 Î¼s

Model by department only

To speed up the inference, instead of conditioning on each employee individually, we collect mean, variance, and count for each department. For a normal distribution, empirical mean m and empirical variance s2 are distributed as (see Wikipedia):

m∼Normal(μ,σn)

s2∼Gamma(n−12,2σ2n−1)

Conditioning on individual employees would work too but take significantly more time — we would have ≈2000 observation points instead of just 165.

18.6 Î¼s
1.4 ms

Conditioning on variance can cause problems of two kinds. First, for less than 2 samples, the variance is undefined. Then, if all salaries are the same, the variance is zero, and the density of Gamma distribution on 0 for k > 1 is 0.

Therefore, we only condition on variance if there are enough samples (more than 3) and the values of the samples are different. This constraint is an indirect consequence of our use of LogNormal approximation of the distribution of salaries.

8.5 Î¼s
model_by_dept (generic function with 1 method)
204 Î¼s