Thursday, 13 February 2014

Sane as it ever was: The historical meaning of the crisis In Psychology

Here's my first experiment in self-publishing on figshare!

Basically it means you can cite the essay as follows:

Hasselman, Fred (2014): Sane as it ever was: The historical meaning of the crisis In Psychology. figshare. http://dx.doi.org/10.6084/m9.figshare.930729

Oh.. and you get this embedded iframe code too, pretty cool so far!


Saturday, 11 January 2014

Time = Money: The Morality of Being Accurate

Time = Money: The Morality of Being Accurate

A Post-Publication Peer-Review (3PR) of Time, Money, and Morality

Gino, F., & Mogilner, C. (online, 2013). Time, Money, and Morality. Psychological Science. DOI: 10.1177/0956797613506438


File under:

HIBAR: Had I Been A Reviewer…

3PR: Post-Publication Peer-Review (or: 3mpirical Plausability Resuscitation)

Performed by Fred Hasselman
Contact me if you have any questions


Introduction

The Time, Money, and Morality article has been HIBAR-ed on Twitter and the Blogosphere (e.g., by Rolf Zwaan and Greg Francis ) and the discussion seems to revolve around the validity of the inferences based p-values close to 0.05 (e.g., they raise suspicions of p-hacking).

In short, the article reports of 4 Experiments testing 2 core postulates:

  • Postulate 1: Priming Money activates self-interest and increases unethical behaviour
  • Postulate 2: Priming Time activates self-reflection and decreases unethical behaviour

Unethical behaviour is operationalised as taking the opportunity to cheat on a task.
Priming methods vary across experiments, so do the tasks that allow for an opportunity to cheat.
In Experiment 1 the two postulates are tested, Experiments 2-4 concern an assessment of the role of self-reflection on cheating behaviour and is operationalised differently across experiments.

Hold on to your P-curves for a moment… Back to the basics!

In this Post-Publication Peer-Review (3PR) I demonstrate that there is indeed some cause for concern about the way these results are presented and interpreted. Was it p-hacking? … I don't know and maybe I don't even care. To me this is an example of sloppy science, p-hacked or not, these results were allowed to be published by expert peers. It is more relevant to discuss the broken system of quality control that should have picked up on at least some of the following issues:

  1. Important information is missing:
    • in general (e.g., number of subjects per condition, sample size determination)
    • selectively across experiments (e.g., participants per cell, reporting of effect sizes)
  2. The analyses used on frequency data are inappropriate
  3. Invalid or biased inferences and oddities:
    • No adjustments for multiple comparisons
    • “Marginal significance” shifts ad hoc between 0.1 > p > 0.05
    • Obvious intervening/mediator variable is omitted: Accuracy of performance
    • No explanation of (conflicting) results across experiments (e.g., variation in amount of cheating)
    • No explanation for failing of random assignment to design levels (none of the experiments have equal N samples)

The article under scrutiny is by no means exceptional with respect to such issues, moreover, the way frequency / proportion data are analysed in psychological science is generally awkward and most of the time completely wrong.

I will 3PR the data based on the information in the article and comment on the results:

I. Analysis of Proportion / frequency data
II. Analysis of Extent of Cheating data
III. HAPPE-ing: Hypothesing After Post Publication Evaluation

The R code used to generate the results (and this page) is available in this Markdown file, and this post explains how to post to a WordPress blog.


I. Analysis of proportion / frequency data

Some concerns can be raised about the significant differences between various conditions in proportion Cheating reported in the 4 experiments.
First and foremost, no corrections for multiple comparisons are conducted, should one do so, just 2 significant proportion differences remain:
Money vs. Time in experiment 1 & 4. In Experiment 3, the sample difference No Mirror: Money - Time was marginally significant in the 2nd significant digit (original: p = 0.015, adjusted \( \alpha \) = 0.013, Bonferroni).

Second, no continuity correction is applied, these proportions are calculated from discrete numbers (participants). If a continuity correction is applied, 2-3 significant differences remain, depending on the \( \alpha \)-level chosen:

Exp. Contrast Published Continuity corrected Bonferroni adjusted
1 Money-Time <.001 4 × 10-4 < 0.0167
1 Money-Ctrl <.05 0.0894 > 0.0167
1 Time-Ctrl <.05 0.0836 > 0.0167
2 Int: Money-Time <.01 0.1493 ~ 0.0125
2 Per: Money-Time >.05 1 > 0.0125
2 Money: Int-Per <.03 0.0856 > 0.0125
2 Time: Int-Per >.05 1 > 0.0125
3 Mir: Money-Time >.05 0.7996 > 0.0125
3 NoM: Money-Time <.003 0.0293 ~ 0.0125
3 Money: Mir-NoM >.05 0.0537 > 0.0125
3 Time: Mir-NoM >.05 1 > 0.0125
4 Money-Time <.001 10-4 < 0.0167
4 Money-Ctrl <.05 0.0522 > 0.0167
4 Time-Ctrl <.05 0.0752 > 0.0167
Number sig. results 9 3 Original: 4, Continuity: 2

This calls for a more appropriate analysis of frequency data:

  1. Log-linear analysis of observed cell frequencies
  2. Exact odds ratios of 2x2 sub-tables to test hypotheses using Effect Size CIs

(Cheating can be considered a dichotomous response, so logistic regression could also be used, see III. HAPPE-ing)

Note:
Experiment 2 & 3 do not list n per condition, the most likely values for n (1. closest to an integer value; 2. as equal as possible; 3. Add to total N) are assumed:

Experiment 2

Prime Assessment Ncond * %Cheat = Ncheat (deviation)
Money Personality 36 * 0.2778 = 10.0008 (8 × 10-4)
Time Personality 35 * 0.2857 = 9.9995 (5 × 10-4)
Money Intelligence 38 * 0.5 = 19 (0)
Time Intelligence 33 * 0.303 = 9.999 (10 × 10-4)

Experiment 3

Prime Assessment Ncond * %Cheat = Ncheat (deviation)
Money Mirror 31 * 0.387 = 11.997 (0.003)
Time Mirror 28 * 0.321 = 8.988 (0.012)
Money No Mirror 30 * 0.667 = 20.01 (0.01)
Time No Mirror 31 * 0.355 = 11.005 (0.005)

1. log-linear analysis of observed cell frequencies

Log-linear analysis, or poisson regression using the generalised linear model, can be used to test whether relationships exist among the variables in a multi-way contingency table. Here I analyse the number of participants in each cell of the design: The observed frequencies take the role of the dependent variable and the levels of the design factors such as Mediator, Prime and Cheating are considered the levels of independent variables (another option would have been a logistic / probit regression with Cheating as the dependent binary / proportion variable).

Two types of result given for each experiment:

First, a table listing deviance tests for the full (saturated) model. The analysis starts with the NULL model (all frequencies are equal) in the first row. Each subsequent row lists what happens to the deviance (of the model in the previous row) when a factor is added. A significant drop in deviance means adding the factor to the model contributes to predicting the difference between expected and observed frequencies. For hints of corroboration of the hypotheses reported in the paper, significant interactions between a design factor and Cheating are necessary.

Second, a mosaic plot is displayed, this is a graphical representation of the conditional cell frequencies. The mosaic plot also indicates which residual frequencies (observed - expected) are significantly below (red) or above (blue) the expected frequencies (residuals are interpretable as a Z-score). The coloured cells contribute most to a high and possibly significant \( \chi^2 \) value.

Note: The significance of the change in deviance can depend on the order in which factors are added to the model and is not the same as a significant beta weight in a regression model.

> [1] "Experiment 1"
> Analysis of Deviance Table
> 
> Model: poisson, link: log
> 
> Response: Count
> 
> Terms added sequentially (first to last)
> 
> 
>                Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
> NULL                               5       24.8             
> Cheating        1     9.33         4       15.4  0.00225 ** 
> Prime           2     0.02         2       15.4  0.98981    
> Cheating:Prime  2    15.41         0        0.0  0.00045 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot of chunk loglin

> [1] "Experiment 2"
> Analysis of Deviance Table
> 
> Model: poisson, link: log
> 
> Response: Count
> 
> Terms added sequentially (first to last)
> 
> 
>                     Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
> NULL                                    7      19.64             
> Cheating             1    13.86         6       5.78   0.0002 ***
> Prime                1     0.25         5       5.52   0.6146    
> Test                 1     0.00         4       5.52   1.0000    
> Cheating:Prime       1     1.51         3       4.02   0.2198    
> Cheating:Test        1     2.53         2       1.48   0.1114    
> Prime:Test           1     0.03         1       1.45   0.8609    
> Cheating:Prime:Test  1     1.45         0       0.00   0.2284    
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot of chunk loglin

> [1] "Experiment 3"
> Analysis of Deviance Table
> 
> Model: poisson, link: log
> 
> Response: Count
> 
> Terms added sequentially (first to last)
> 
> 
>                     Df Deviance Resid. Df Resid. Dev Pr(>Chi)  
> NULL                                    7      11.50           
> Cheating             1     2.14         6       9.36    0.144  
> Prime                1     0.03         5       9.32    0.855  
> Test                 1     0.03         4       9.29    0.855  
> Cheating:Prime       1     4.24         3       5.05    0.040 *
> Cheating:Test        1     2.85         2       2.21    0.092 .
> Prime:Test           1     0.50         1       1.71    0.481  
> Cheating:Prime:Test  1     1.71         0       0.00    0.191  
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot of chunk loglin

> [1] "Experiment 4"
> Analysis of Deviance Table
> 
> Model: poisson, link: log
> 
> Response: Count
> 
> Terms added sequentially (first to last)
> 
> 
>                Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
> NULL                               5       21.3             
> Cheating        1     4.22         4       17.1  0.03996 *  
> Prime           2     0.29         2       16.8  0.86607    
> Cheating:Prime  2    16.76         0        0.0  0.00023 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot of chunk loglin

Conclusion log-linear analysis:
This alternative, and in my opinion more appropriate analysis is in agreement with the results after correction for multiple comparisons and continuity:

  • The mosaic plots show that there may be some unexpected factors driving the “effects” reported in the paper:
    • In experiment 1 & 4 it is not so much the observed frequency of people that did cheat, but the number of participants that did not cheat that deviate from the expected frequencies based on table margins.
    • The Money prime caused less people to NOT cheat, whereas the Time prime caused more people to NOT cheat
  • If there is a difference in amount of Cheating between samples, it is likely a “main effect” between the Time and Money prime (Cheating:Prime interaction), it is found to cause a significant drop in deviance in Experiments 1, 3 and 4.
  • Experiment 2 stands out, because observed differences in Cheating are unlikely due to chance, but none of the other factors contribute to explain differences between expected and observed frequencies.

The point about the mosaic plots is not just semantics or methodologists' nit-picking. What it tells us is that, e.g. in the mosaic plot Table.1.1, among the observed frequencies of CheatYES, the cell Money does not stand out much from Time and Control from what may be expected by chance, for CheatNO on the other hand, the cell Money does stand out as different.

2. Exact odds ratios of 2x2 subtables to test hypotheses using Effect Size CIs

Effect Size Confidence Intervals:
To get a clearer idea about the significance between cell differences I calculate confidence intervals around the effect size associated with contingency tables. The CIs in Figure 1 below are based on the exact Odds Ratio (using the non-central hypergeomteric distribution) for a 2x2 sub-table of the full design obtained from Fisher's Exact Test, testing against \( H_0: OR = 1 \).

> [1] "Figure 1. Exact log Odds Ratio's of 2x2 tables comparing frequency of Cheating between independent samples in each experiment."

plot of chunk ChiCIs

Note:
Here, the Confidence Levels have been adjusted to account for the fact that 3 (EXP1&4) and 4 (EXP2&3) subtables of the full design were compared (1-(0.05 / #tests)). The exact p-value from Fisher's exact test reported in the Figure was multiplied by the number of comparisons in each experiment.

Conclusion Proportion data

  • If there is an effect, it exists as a “main-effect” difference between the Money and Time primed samples in Experiment 1 and 4.
  • Experiment 3 No Mirror: Money - Time is a marginal case.
  • Experiment 2 did not yield any substantial effects.
  • 4-5 out of 7 statistical inferences in the paper that are made based on proportion data should be considered invalid.

II. Analysis of extent of cheating

The extent of Cheating concerns the difference between actual accuracy (which is not provided as a result) and reported accuracy by a participant.
Experiment 1-3 report analyses of extent of Cheating including means and SD's. Sample size assumptions for Experiments 2 and 3 are the same as above.

Compare Cohen's d CIs

I created CIs around the effect sizes based on the means and SD reported for Experiment 1-3 using the R package MBESS.

> [1] "Figure 2. Cohen's d with exact CIs comparing extent of Cheating between independent samples in experiment 1-3."

plot of chunk ExtCIs

Conclusion Extent of Cheating

The pattern is the same as the previous analyses:

  • Experiment 1 shows a clear effect between Money and Time samples
  • Experiment 3 No Mirror: Money - Time is again a close call

III. HAPPE-ing (Hypothesising After Post-Publication Evaluation)

Should reviewers have noticed these issues with data analysis?

Yes, they should have!

Even without re-analysing the published data as I have done here, the conclusions by the authors can be questioned based on a comparison of very elementary results:

Across four experiments, using different primes and a variety of measures and tasks, we consistently
found that shifting people’s attention to time decreases dishonesty. Priming time makes people reflect
on who they are, and this self-reflection reduces their likelihood of behaving dishonestly.

The clue is to compare the results across the 4 experiments and evaluate whether it is valid to infer that the core postulates have been corroborated. The designs and materials are slightly different each time, but if variation in outcomes (e.g., proportion cheating behaviour) varies systematically with one or more of the experimental differences, there may be another variable at work here.

One result that begs explanation is the drop in proportion Cheating in all the samples of Experiment 2 when compared to the other experiments. What is special about the procedure and methods? Regrettably more than 1 potential intervening factor changes with respect to Experiment 1.

A second odd omission in the interpretation of the results is the level of accuracy achieved by participants. In Experiments 1-3, the urge to cheat must have been less when a participant had achieved 90% accuracy. Experiment 4 is somewhat different in that the cheating opportunity concerns one “bottleneck” problem that is difficult to solve, but has to be correct in order to make other more easily solvable problems count in adding to the final reward. Here, accuracy could have an opposite effect in which less accurate participants cheat less. If 0 or only 1 extra item past the “bottleneck” item were solved, a participant might be less inclined to cheat than a participant who solved every problem except for the “bottleneck” item.

What is mediating what?

The figure below shows the interaction between the maximal financial incentive that could be awarded and the proportion cheating for each prime and experimental condition (indicating whether a mediator variable was manipulated in addition to being exposed to a prime). Note that the Intelligence and the No Mirror condition of Experiments 2 and 3 respectively are considered similar to Experiment 1 and 4, that is, they reflect a condition in which Self-reflection was not induced by any other means than priming:

plot of chunk Exp2_Reward

This relationship can be tested in a generalised linear model, of course being fully aware that this is exploratory HAPPE-ing. I assume the samples from each experiment are independent and use the number of cheaters vs. no cheaters as the dependent binomial variable. The model contains only those effects for which data are available (e.g., no interactions with both Prime and Mediator)

Note:
A generalised linear mixed model (GLMM) with sample ID as a random effect gives similar results.

> 
> Call:
> glm(formula = cbind(CheatYES, CheatNO) ~ Reward + Prime + Mediator + 
>     Reward * Prime + Reward * Mediator, family = binomial, data = reward)
> 
> Deviance Residuals: 
>    Min      1Q  Median      3Q     Max  
> -1.153  -0.695  -0.122   0.251   1.956  
> 
> Coefficients:
>                                Estimate Std. Error z value Pr(>|z|)   
> (Intercept)                     -0.4495     0.2195   -2.05   0.0405 * 
> Reward                           0.0111     0.0219    0.51   0.6125   
> PrimeNone                        0.5868     0.3973    1.48   0.1397   
> PrimeMoney                       0.6040     0.2824    2.14   0.0325 * 
> MediatorSelf-reflection         -0.8128     0.3147   -2.58   0.0098 **
> Reward:PrimeNone                 0.0167     0.0359    0.47   0.6416   
> Reward:PrimeMoney                0.0698     0.0327    2.13   0.0329 * 
> Reward:MediatorSelf-reflection  -0.0189     0.0434   -0.44   0.6626   
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> (Dispersion parameter for binomial family taken to be 1)
> 
>     Null deviance: 76.292  on 13  degrees of freedom
> Residual deviance: 11.035  on  6  degrees of freedom
> AIC: 82.48
> 
> Number of Fisher Scoring iterations: 4
> [1] "Null-model deviance test: p < 1.33525644154704e-11"

In the table above the model Intercept corresponds to the odds of Cheating compared to the Null-model when the predictors have the values: Prime = Time, Mediator = None and Reward = 0. Compared to the overall probability of observing Cheating behaviour, it thus seems that when the Time prime is presented without an induction of Self-reflection and a financial reward incentive, the odds of Cheating drop.

This appears to be a corroboration of the second postulate, but note that in this analysis (just as in the previous analyses), there is no real difference between the Time prime and prime = None. The standard errors around these parameters are quite high. A clearer picture emerges when the Intercept is defined as Prime = None, Mediator = None and Reward = 0 and the Odds Ratios are compared (exponentiation of the parameter estimates):

> [1] "Odds Ratios compared to Prime = None, with profile likelihood CI.95"
>                                  OR 2.5 % 97.5 %
> (Intercept)                    1.15  0.60   2.21
> Reward                         1.03  0.97   1.09
> PrimeTime                      0.56  0.25   1.21
> PrimeMoney                     1.02  0.47   2.21
> MediatorSelf-reflection        0.44  0.24   0.81
> Reward:PrimeTime               0.98  0.92   1.05
> Reward:PrimeMoney              1.05  0.98   1.14
> Reward:MediatorSelf-reflection 0.98  0.90   1.07

The odds ratios in the table above are multiplicative changes to the Probability of Cheating = 1 when the predictor increases by 1 unit. So an OR < 1 will decrease the odds of observing Cheating behaviour and an OR > 1 will increase it. The 95% CIs are based on the profile likelihood and show that in most cases the effect covers a range below and above 1. The range for the effect of Self-Reflection is always below 1.

One can interpret the modelled relationship between these variables as follows:

  • There is a weak positive association between the Maximal Financial Reward and the Probability of Cheating
  • The association changes with the value of Prime, becoming stronger when Money is primed, weaker when Time is primed
  • The induction of Self-reflection does not cause the association to change, it changes the intercept, the base-line Probability of Cheating at Reward = 0

A graphical representation of the model predictions more clearly reveals this relationship:

plot of chunk Exp2_GLM3


Conclusions, Discussion and further HAPPE-ing

  1. The significant results between Time and Money in Experiments 1 and 4 probably arise due to the increase in Probability of Cheating when there is a financial reward and Money is primed.
    • It is unlikely there are any other “real” differences in these data except for the induction of Self-reflection: Model predictions show it decreases the Probability of Cheating by the same amount for different primes
    • Note that there were no actual data points for None + Self-reflection
  2. The missing predictors in the Probability of Cheating analysis are the actual and reported accuracy of the performance (amount of correctly solved problems and money received respectively). These values cannot be inferred from the extent of cheating analyses. It seems reasonable to assume in most experiments there was less incentive to engage in Cheating by participants who were more accurate.
    • This brings up the question of whether the effects are driven by some sort of Speed-Accuracy instruction: Naturally, Time = Money, but taking the time to solve the problems may lead to higher accuracy and less incentive to cheat, likewise a focus on getting as many answers as possible may introduce errors and promote cheating.

In science there is a moral obligation to do the best one can to be as accurate as possible and usually this means it is wise to be as modest as possible about ones' scientific claims. I am not an expert in this field, but the sheer amount of questions that can be raised about the validity of the inferences made in this paper makes one wonder who the peers were that achieved consensus about the credibility of this research and what their area of expertise was.

I am not saying this is irrelevant, or poor research; the two effects that survive the scrutiny of 3PR are certainly interesting. I am just a little worried this paper says more about the morality of contemporary scientific publishing than the scientific study of moral behaviour.


Some notes about this file:

  • This file was created using Markdown in RStudio: Unless otherwise indicated in the code blocks (e.g., by require), the basic R packages are used.
  • All the analyses are based on results reported in the publication.
  • The one true gospel on statistical inference does not exist and more than one approach to analyse these data may be defensible.
  • Therefore: Please be aware these comments and suggestions reflect my own preferences and standards in these matters. If you feel I should change some of my preferences and/or standards please let me know, because I review and adjust them on a regular basis.