\documentclass{chapman}
%%% copy Sweave.sty definitions
%%% keeps `sweave' from adding `\usepackage{Sweave}': DO NOT REMOVE
%\usepackage{Sweave}
\RequirePackage[T1]{fontenc}
\RequirePackage{graphicx,ae,fancyvrb}
\IfFileExists{upquote.sty}{\RequirePackage{upquote}}{}
\usepackage{relsize}
\DefineVerbatimEnvironment{Sinput}{Verbatim}{}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{fontfamily=courier,
fontshape=it,
fontsize=\relsize{-1}}
\DefineVerbatimEnvironment{Scode}{Verbatim}{}
\newenvironment{Schunk}{}{}
%%% environment for raw output
\newcommand{\SchunkRaw}{\renewenvironment{Schunk}{}{}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{fontfamily=courier,
fontshape=it,
fontsize=\small}
\rawSinput
}
%%% environment for labeled output
\newcommand{\nextcaption}{}
\newcommand{\SchunkLabel}{
\renewenvironment{Schunk}{\begin{figure}[ht] }{\caption{\nextcaption}
\end{figure} }
\DefineVerbatimEnvironment{Sinput}{Verbatim}{frame = topline}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{frame = bottomline,
samepage = true,
fontfamily=courier,
fontshape=it,
fontsize=\relsize{-1}}
}
%%% S code with line numbers
\DefineVerbatimEnvironment{Sinput}
{Verbatim}
{
%% numbers=left
}
\newcommand{\numberSinput}{
\DefineVerbatimEnvironment{Sinput}{Verbatim}{numbers=left}
}
\newcommand{\rawSinput}{
\DefineVerbatimEnvironment{Sinput}{Verbatim}{}
}
%%% R / System symbols
\newcommand{\R}{\textsf{R}}
\newcommand{\rR}{{R}}
\renewcommand{\S}{\textsf{S}}
\newcommand{\SPLUS}{\textsf{S-PLUS}}
\newcommand{\rSPLUS}{{S-PLUS}}
\newcommand{\SPSS}{\textsf{SPSS}}
\newcommand{\EXCEL}{\textsf{Excel}}
\newcommand{\ACCESS}{\textsf{Access}}
\newcommand{\SQL}{\textsf{SQL}}
%%\newcommand{\Rpackage}[1]{\hbox{\rm\textit{#1}}}
%%\newcommand{\Robject}[1]{\hbox{\rm\texttt{#1}}}
%%\newcommand{\Rclass}[1]{\hbox{\rm\textit{#1}}}
%%\newcommand{\Rcmd}[1]{\hbox{\rm\texttt{#1}}}
\newcommand{\Rpackage}[1]{\index{#1 package@{\fontseries{b}\selectfont #1} package} {\fontseries{b}\selectfont #1}}
\newcommand{\rpackage}[1]{{\fontseries{b}\selectfont #1}}
\newcommand{\Robject}[1]{\texttt{#1}}
\newcommand{\Rclass}[1]{\index{#1 class@\textit{#1} class}\textit{#1}}
\newcommand{\Rcmd}[1]{\index{#1 function@\texttt{#1} function}\texttt{#1}}
\newcommand{\Roperator}[1]{\texttt{#1}}
\newcommand{\Rarg}[1]{\texttt{#1}}
\newcommand{\Rlevel}[1]{\texttt{#1}}
%%% other symbols
\newcommand{\file}[1]{\hbox{\rm\texttt{#1}}}
%%\newcommand{\stress}[1]{\index{#1}\textit{#1}}
\newcommand{\stress}[1]{\textit{#1}}
\newcommand{\booktitle}[1]{\textit{#1}} %%'
%%% Math symbols
\usepackage{amstext}
\usepackage{amsmath}
\newcommand{\E}{\mathsf{E}}
\newcommand{\Var}{\mathsf{Var}}
\newcommand{\Cov}{\mathsf{Cov}}
\newcommand{\Cor}{\mathsf{Cor}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\y}{\mathbf{y}}
\renewcommand{\a}{\mathbf{a}}
\newcommand{\W}{\mathbf{W}}
\newcommand{\C}{\mathbf{C}}
\renewcommand{\H}{\mathbf{H}}
\newcommand{\X}{\mathbf{X}}
\newcommand{\B}{\mathbf{B}}
\newcommand{\V}{\mathbf{V}}
\newcommand{\I}{\mathbf{I}}
\newcommand{\D}{\mathbf{D}}
\newcommand{\bS}{\mathbf{S}}
\newcommand{\N}{\mathcal{N}}
\renewcommand{\L}{L}
\renewcommand{\P}{\mathsf{P}}
\newcommand{\K}{\mathbf{K}}
\newcommand{\m}{\mathbf{m}}
\newcommand{\argmin}{\operatorname{argmin}\displaylimits}
\newcommand{\argmax}{\operatorname{argmax}\displaylimits}
\newcommand{\bx}{\mathbf{x}}
\newcommand{\bbeta}{\mathbf{\beta}}
%%% links
\usepackage{hyperref}
\hypersetup{%
pdftitle = {A Handbook of Statistical Analyses Using R (3rd Edition)},
pdfsubject = {Book},
pdfauthor = {Torsten Hothorn and Brian S. Everitt},
colorlinks = {black},
linkcolor = {black},
citecolor = {black},
urlcolor = {black},
hyperindex = {true},
linktocpage = {true},
}
%%% captions & tables
%% : conflics with figure definition in chapman.cls
%%\usepackage[format=hang,margin=10pt,labelfont=bf]{caption}
%%
\usepackage{longtable}
\usepackage[figuresright]{rotating}
%%% R symbol in chapter 1
\usepackage{wrapfig}
%%% Bibliography
\usepackage[round,comma]{natbib}
\renewcommand{\refname}{References \addcontentsline{toc}{chapter}{References}}
\citeindexfalse
%%% texi2dvi complains that \newblock is undefined, hm...
\def\newblock{\hskip .11em plus .33em minus .07em}
%%% Example sections
\newcounter{exercise}[chapter]
\setcounter{exercise}{0}
\newcommand{\exercise}{\stepcounter{exercise} \item{Ex.~\arabic{chapter}.\arabic{exercise} }}
%% URLs
\newcommand{\curl}[1]{\begin{center} \url{#1} \end{center}}
%%% for manual corrections
%\renewcommand{\baselinestretch}{2}
%%% plot sizes
\setkeys{Gin}{width=0.95\textwidth}
%%% color
\usepackage{color}
%%% hyphenations
\hyphenation{drop-out}
\hyphenation{mar-gi-nal}
%%% new bidirectional quotes need
\usepackage[utf8]{inputenc}
%\usepackage{setspace}
\definecolor{sidebox_todo}{rgb}{1,1,0.2}
\newcommand{\todo}[1]{
\hspace{0pt}%
\marginpar{%
\fcolorbox{black}{sidebox_todo}{%
\parbox{\marginparwidth} {
\raggedright\sffamily\footnotesize{TODO: #1}%
}
}%
}
}
\begin{document}
%% Title page
\title{A Handbook of Statistical Analyses Using \R{} --- 3rd Edition}
\author{Torsten Hothorn and Brian S. Everitt}
\maketitle
%%\VignetteIndexEntry{Chapter Bayesian Inference}
%%\VignetteDepends{rmeta,coin}
\setcounter{chapter}{17}
\SweaveOpts{prefix.string=figures/HSAUR,eps=FALSE,keep.source=TRUE}
<>=
rm(list = ls())
s <- search()[-1]
s <- s[-match(c("package:base", "package:stats", "package:graphics", "package:grDevices",
"package:utils", "package:datasets", "package:methods", "Autoloads"), s)]
if (length(s) > 0) sapply(s, detach, character.only = TRUE)
if (!file.exists("tables")) dir.create("tables")
if (!file.exists("figures")) dir.create("figures")
set.seed(290875)
options(prompt = "R> ", continue = "+ ",
width = 63, # digits = 4,
show.signif.stars = FALSE,
SweaveHooks = list(leftpar = function()
par(mai = par("mai") * c(1, 1.05, 1, 1)),
bigleftpar = function()
par(mai = par("mai") * c(1, 1.7, 1, 1))))
HSAURpkg <- require("HSAUR3")
if (!HSAURpkg) stop("cannot load package ", sQuote("HSAUR3"))
rm(HSAURpkg)
### hm, R-2.4.0 --vanilla seems to need this
a <- Sys.setlocale("LC_ALL", "C")
###
book <- TRUE
refs <- cbind(c("AItR", "DAGD", "SI", "CI", "ANOVA", "MLR", "GLM",
"DE", "RP", "GAM", "SA", "ALDI", "ALDII", "SIMC", "MA", "PCA",
"MDS", "CA"), 1:18)
ch <- function(x) {
ch <- refs[which(refs[,1] == x),]
if (book) {
return(paste("Chapter~\\\\ref{", ch[1], "}", sep = ""))
} else {
return(paste("Chapter~", ch[2], sep = ""))
}
}
if (file.exists("deparse.R"))
source("deparse.R")
setHook(packageEvent("lattice", "attach"), function(...) {
lattice.options(default.theme =
function()
standard.theme("pdf", color = FALSE))
})
@
\pagestyle{headings}
<>=
book <- FALSE
@
\chapter[Bayesian Inference]{Incorporating Prior Knowledge via Bayesian Inference:
Smoking and Lung Cancer \label{BI}}
\section{Introduction}
\index{Smoking and lung cancer|(}
At the beginning of the 20th century, the death toll due to lung cancer was
on the rise and the search for possible causes began. For lung cancer in pit
workers, animal experiments showed that the so-called `Schneeberg lung
disease' was induced by radiation. But this could not explain the
increasing incidence of lung cancer in the general population. The
identification of possible risk factors was a challenge for epidemiology and
statistics, both disciplines being still in their infancy in the 1920s and 1930s.
The first modern controlled epidemiological study on the effect of smoking
on lung cancer was performed by Franz Hermann M\"uller as part of his
dissertation at the University of Cologne in 1939. The results were
published a year later \citep{HSAUR:Mueller1940}. M\"uller sent out
questionnaires to the relatives of people who had recently died of lung
cancer, asking about the smoking behavior and its intensity of the deceased
relative. He also sent the questionnaire to healthy controls to obtain
information about the smoking behavior in a control group, although it is
not clear how this control group was defined. The number of lung cancer
patients and healthy controls in five different groups (nonsmokers to
extreme smokers) are given in Table~\ref{BI-Smoking_Mueller1940-tab}.
<>=
data("Smoking_Mueller1940", package = "HSAUR3")
toLatex(HSAURtable(Smoking_Mueller1940),
caption = paste("Smoking and lung cancer case-control study by M\\\"uller (1940).",
"The smoking intensities were defined by the number of",
"cigarettes smoked daily:",
"1-15 (moderate), 16-25 (heavy), 26-35 (very heavy),",
"and more than 35 (extreme)."),
label = "BI-Smoking_Mueller1940-tab")
@
Four years later Erich Sch\"oninger also wrote his dissertation on the
association between smoking and lung cancer and, together with his
supervisor Eberhard Schairer at the University of Jena, published his
results on a case-control study
\citep{HSAUR:SchairerSchoeninger1944} where he assessed the smoking behavior
of lung cancer patients, patients diagnosed with other forms of cancer, and
also a healthy control group. The data are given in
Table~\ref{BI-Smoking_SchairerSchoeniger1944-tab}.
<>=
x <- as.table(Smoking_SchairerSchoeniger1944[,
c("Lung cancer", "Healthy control")])
toLatex(HSAURtable(x, xname = "Smoking_SchairerSchoeniger1944"),
caption = paste("Smoking and lung cancer case-control study by Schairer and Sch\\\"oniger (1944). Cancer other than lung cancer omitted.",
"The smoking intensities were defined by the number of",
"cigarettes smoked daily:",
"1-5 (moderate), 6-10 (medium), 11-20 (heavy),",
"and more than 20 (very heavy)."),
label = "BI-Smoking_SchairerSchoeniger1944-tab")
@
Shortly after the war, a Dutch epidemiologist reported on a case-control
study performed in Amsterdam \citep{HSAUR:Wassink1945} and found similar
results as the two German studies; see
Table~\ref{BI-Smoking_Wassink1945-tab}.
<>=
data("Smoking_Wassink1945", package = "HSAUR3")
toLatex(HSAURtable(Smoking_Wassink1945),
caption = paste("Smoking and lung cancer case-control study by Wassink (1945).",
"Smoking categories correspond to the categories used by M\\\"uller (1940)."),
label = "BI-Smoking_Wassink1945-tab")
@
In 1950 perhaps the most important, but not the first, case-control study showing
an increasing risk of developing lung cancer with the amount of tobacco
smoked, was published in Great Britain by Richard Doll and Austin Bradford Hill
\citep{HSAUR:DollHill1950}. We restrict discussion here to data obtained for males
and the data shown in Table~\ref{BI-Smoking_DollHill1950-tab} corresponds to
the most recent amount of tobacco consumed regularly by smokers before
disease onset \citep[Table~V in][]{HSAUR:DollHill1950}.
<>=
data("Smoking_DollHill1950", package = "HSAUR3")
x <- as.table(Smoking_DollHill1950[,,"Male", drop = FALSE])
toLatex(HSAURtable(x, xname = "Smoking_DollHill1950"),
caption = paste("Smoking and lung cancer case-control study (only males) by Doll and Hill (1950).",
"The labels for the smoking categories give the number of cigarettes smoked every day."),
label = "BI-Smoking_DollHill1950-tab")
@
Although the design of the studies by \cite{HSAUR:Mueller1940} and
\cite{HSAUR:SchairerSchoeninger1944}, especially the selection of their
control groups, can be criticized \citep[see][for a detailed
discussion]{HSAUR:Morabia2013} and the study by \cite{HSAUR:DollHill1950}
was larger than the older studies and more detailed information on the
smoking behavior was obtained by direct patient interviews, the information
provided by the earlier studies was not taken into account by
\cite{HSAUR:DollHill1950}. They cite \cite{HSAUR:Mueller1940} in their
introduction, but did not compare their findings with his results. It is
remarkable to see that both \cite{HSAUR:SchairerSchoeninger1944} and
\cite{HSAUR:Wassink1945} extensively made use of the report by
\cite{HSAUR:Mueller1940} and go as far as analyzing the merged data
\citep[Grafiek I, E, and F, in][]{HSAUR:Wassink1945}. In an informal way,
these authors wanted to use the already available information, in today's
terms called `prior knowledge', to make a stronger case with the new data.
Formal statistical methods to incorporate prior knowledge into data analysis
as part of the `Bayesian' way of doing statistical analyses were developed
in the second half of the last century, and we will focus on them in the
present chapter. \index{Smoking and lung cancer|)}
\section{Bayesian Inference}
\section{Analysis Using \R{}}
\subsection{One-by-one Analysis}
For the analysis of the four different case-control studies on smoking and
lung cancer, we will (retrospectively, of course) update our knowledge with
every new study. We begin with a re-analysis of the data described by
\cite{HSAUR:Mueller1940}. Using an approximate permutation test introduced in
Chapter~\ref{CI} for the hypothesis of independence of the amount of
tobacco smoked and group membership (lung cancer or healthy control), we get
<>=
library("coin")
set.seed(29)
independence_test(Smoking_Mueller1940, teststat = "quad",
distribution = approximate(100000))
@
and there is clearly a strong association between the number of cigarettes
smoked and incidence of lung cancer. Because the amount of tobacco smoked
is an ordered categorical variable, it is more appropriate to take this
information into account, for example by means of a linear association test
(see Chapter~\ref{CI}). Nonsmokers receive a score of zero, and for the
remaining groups we choose the mid-point of the intervals of daily
cigarettes smoked that were used by \cite{HSAUR:Mueller1940} to define his
groups:
<>=
ssc <- c(0, 1 + 14 / 2, 16 + 9 / 2, 26 + 9 / 2, 40)
independence_test(Smoking_Mueller1940, teststat = "quad",
scores = list(Smoking = ssc),
distribution = approximate(100000))
@
The result shows that the data are in favor of an ordered alternative. The
$p$-values obtained from approximate permutation tests are attractive
because no distributional assumptions are required, but it is hard to derive
estimates and confidence intervals for interpretable parameters from such
tests. We will therefore now switch to logistic regression models as
described in Chapter~\ref{GLM} to model the odds of lung cancer in the
different smoking groups. Before we start, let us define a small function
for computing odds (for intercept parameters) and odds ratios (for
difference parameters) and corresponding confidence intervals from a
logistic regression model:
<>=
eci <- function(model)
cbind("Odds (Ratio)" = exp(coef(model)),
exp(confint(model)))
@
We model the probability of developing lung cancer given the smoking
behavior. Because our data was obtained from case-control studies where the
groups (lung cancer patients and healthy controls) were defined first and
only after that we observed data on the smoking behavior (in a so-called
\stress{choice-based sampling}), this may seem the wrong model to start
with. However, the marginal distribution of the two groups only changes the
intercept in such a logistic model and the effects of smoking can still be
interpreted in the way we require \citep[see][for example]{HSAUR:Tutz2012}.
The formula for specifying a logistic regression model can be set up such
that the response is a matrix with two columns for each smoking group
consisting of the number of lung cancer deaths and the number of healthy
controls. Although smoking is an ordered factor, we first fit the model
with treatment contrasts, i.e., we can interpret the $\exp$ of the
regression coefficients as odds ratios between each smoking group and
nonsmokers:
<>=
smoking <- ordered(rownames(Smoking_Mueller1940),
levels = rownames(Smoking_Mueller1940))
contrasts(smoking) <- "contr.treatment"
eci(glm(Smoking_Mueller1940 ~ smoking, family = binomial()))
@
We see that all but one of the odds ratios increase with the amount of
tobacco smoked with a maximum of almost $30$ for extreme smokers (more than
$35$ cigarettes per day). The likelihood confidence intervals are rather
wide due to the limited sample size, but also the lower limit increases with
smoking.
An alternative model formulation can help to compare each smoking group with
the preceding group, the so-called split-coding \citep[for this and other
codings see][]{HSAUR:Tutz2012}:
<>=
K <- diag(nlevels(smoking) - 1)
K[lower.tri(K)] <- 1
contrasts(smoking) <- rbind(0, K)
eci(glm(Smoking_Mueller1940 ~ smoking, family = binomial()))
@
The two largest differences are between moderate smokers and nonsmokers
(\Robject{smoking1}) and between very heavy and heavy smokers
(\Robject{smoking3}). The latter group difference seems, at least judged by
the confidence interval, to be larger than expected under a model with no
effect of smoking.
For the analysis of the three remaining studies, we first perform
permutation tests for the independence of smoking and the two groups
(lung cancer and healthy controls) in males:
<>=
xSS44 <- as.table(Smoking_SchairerSchoeniger1944[,
c("Lung cancer", "Healthy control")])
ap <- approximate(100000)
pvalue(independence_test(xSS44,
teststat = "quad", distribution = ap))
pvalue(independence_test(Smoking_Wassink1945,
teststat = "quad", distribution = ap))
xDH50 <- as.table(Smoking_DollHill1950[,, "Male"])
pvalue(independence_test(xDH50,
teststat = "quad", distribution = ap))
@
All $p$-values indicate that the data are not well-described by the
independence model.
\subsection{Joint Bayesian Analysis}
For a Bayesian analysis, we first merge the data from all four studies into
one data frame. In doing so, we also merge the smoking groups in a way that
we only have three groups left: nonsmokers, moderate smokers, and heavy
smokers. These groups are chosen in a way that the number of daily
cigarettes is comparable. We first merge the heavy, very heavy, and extreme
smokers from \cite{HSAUR:Mueller1940}
<>=
(M <- rbind(Smoking_Mueller1940[1:2,],
colSums(Smoking_Mueller1940[3:5,])))
@
and proceed with the lung cancer patients and healthy controls from
\cite{HSAUR:SchairerSchoeninger1944} in the same way
<>=
SS <- Smoking_SchairerSchoeniger1944[,
c("Lung cancer", "Healthy control")]
(SS <- rbind(SS[1,], colSums(SS[2:3,]), colSums(SS[4:5,])))
@
and finally perform the same exercise for the \cite{HSAUR:Wassink1945}
and \cite{HSAUR:DollHill1950} data
<>=
(W <- rbind(Smoking_Wassink1945[1:2,],
colSums(Smoking_Wassink1945[3:4,])))
DH <- Smoking_DollHill1950[,, "Male"]
(DH <- rbind(DH[1,], colSums(DH[2:3,]), colSums(DH[4:6,])))
@
The three new groups are now called nonsmokers, moderate smokers, and
heavy smokers, and we set up a data frame that contains
the number of people in each of the possible groups for all studies:
<>=
smk <- c("Nonsmoker", "Moderate smoker", "Heavy smoker")
x <- expand.grid(Smoking = ordered(smk, levels = smk),
Diagnosis = factor(c("Lung cancer", "Control")),
Study = c("Mueller1940", "SchairerSchoeniger1944",
"Wassink1945", "DollHill1950"))
x$weights <- c(as.vector(M), as.vector(SS),
as.vector(W), as.vector(DH))
@
Before we fit logistic regression models using the data organized in such a
way, we define the contrasts for the smoking ordered factor and
expand the data in a way that each row corresponds to one person. This is necessary
because the \Rcmd{weights} argument to the \Rcmd{glm} function must not be used
to define case weights:
<>=
contrasts(x$Smoking) <- "contr.treatment"
x <- x[rep(1:nrow(x), x$weights),]
@
We now compute one logistic regression model for each study for later
comparisons:
<>=
models <- lapply(levels(x$Study), function(s)
glm(Diagnosis ~ Smoking, data = x, family = binomial(),
subset = Study == s))
names(models) <- levels(x$Study)
@
In 1939, M\"uller was hardly in the position to come up with a reasonable
prior for the odds ratios between moderate or heavy smokers and nonsmokers. So
we also use a noninformative prior and just perform the maximum likelihood analysis:
<>=
eci(models[["Mueller1940"]])
@
Four years later, the maximum likelihood results obtained for the
\cite{HSAUR:SchairerSchoeninger1944} data
<>=
eci(models[["SchairerSchoeniger1944"]])
@
could have been improved by using a normal prior for the difference in log odds
whose distribution is the distribution of the maximum likelihood estimator obtained
for M\"uller's data. At least approximately, we can compute posterior
$90\%$ credibility intervals and the posterior mode from the
Schairer and Sch\"oniger data by analyzing both data sets simultaneously.
We should, however, keep in mind that the odds of developing lung cancer
for nonsmokers is not really interesting for our analysis and that
the four studies may very well differ with respect to this intercept
parameter. Consequently, we don't want to specify a prior for the
intercept. One way to implement such a strategy is to exclude the intercept
term from the joint model while allowing a separate intercept for
each of the studies:
<>=
mM40_SS44 <- glm(Diagnosis ~ 0 + Study + Smoking, data = x,
family = binomial(),
subset = Study %in% c("Mueller1940",
"SchairerSchoeniger1944"))
eci(mM40_SS44)
@
We observe two important differences between the maximum likelihood and
Bayesian results for the Schairer and Sch\"oniger data: In the Bayesian
analysis, the estimated odds ratio for moderate smokers is closer to the
smaller value obtained from M\"uller's data and, more important, the
credibility intervals are much narrower and, one has to say, more realistic now.
An odds ratio as large as $40$ is hardly something one would expect to see in
practice.
If Wassink had been aware of Bayesian statistics, he could have used
the posterior distribution of the parameters from
our model \Robject{mM40\_SS44} as a prior distribution for analyzing his
data. The maximum likelihood results for his data
<>=
eci(models[["Wassink1945"]])
@
would have changed to
<>=
mM40_SS44_W45 <- glm(Diagnosis ~ 0 + Study + Smoking,
data = x, family = binomial(),
subset = Study %in% c("Mueller1940",
"SchairerSchoeniger1944",
"Wassink1945"))
eci(mM40_SS44_W45)
@
The rather small odds ratios obtained from the model fitted to the
Wassink data only are now closer to the estimates obtained from the
two previous studies and the variability, as given by the credibility intervals,
is much smaller.
Now, finally, the model for the Doll and Hill data reports rather large
odds ratios with wide confidence intervals:
<>=
eci(models[["DollHill1950"]])
@
With a (now rather strong) prior defined by the three earlier studies, we get
from the joint model for all four studies
<>=
m_all <- glm(Diagnosis ~ 0 + Study + Smoking, data = x,
family = binomial())
eci(m_all)
@
<>=
r <- eci(m_all)
xM <- round(r["SmokingModerate smoker", 2:3], 1)
xH <- round(r["SmokingHeavy smoker", 2:3], 1)
@
In 1950, the joint evidence based on such an analysis with an odds ratio
between $\Sexpr{xM[1]}$ and $\Sexpr{xM[2]}$ for moderate smokers and between
$\Sexpr{xH[1]}$ and $\Sexpr{xH[2]}$ for heavy smokers compared to
nonsmokers, would have made a much stronger case than any of the single
studies alone. It is interesting to see that with this strong prior for the
Doll and Hill study, we also get relatively large odds ratios when comparing
heavy to moderate smokers (see row labeled \Rcmd{Smoking2}):
<>=
K <- diag(nlevels(x$Smoking) - 1)
K[lower.tri(K)] <- 1
contrasts(x$Smoking) <- rbind(0, K)
eci(glm(Diagnosis ~ 0 + Study + Smoking, data = x,
family = binomial()))
@
\subsection{A Comparison with Meta Analysis}
One may ask how the Bayesian approach of progressively updating the estimates
considered here differs from a classical meta analysis described in
Chapter~\ref{MA}. We
first reshape the data into a form suitable for such an analysis
<>=
y <- xtabs(~ Study + Smoking + Diagnosis, data = x)
ntrtM <- margin.table(y, 1:2)[,"Moderate smoker"]
nctrl <- margin.table(y, 1:2)[,"Nonsmoker"]
ptrtM <- y[,"Moderate smoker","Lung cancer"]
pctrl <- y[,"Nonsmoker","Lung cancer"]
ntrtH <- margin.table(y, 1:2)[,"Heavy smoker"]
ptrtH <- y[,"Heavy smoker","Lung cancer"]
@
and then compute joint odds ratios and confidence intervals for moderate and
heavy smokers compared to nonsmokers:
<>=
library("rmeta")
meta.MH(ntrt = ntrtM, nctrl = nctrl,
ptrt = ptrtM, pctrl = pctrl)
meta.MH(ntrt = ntrtH, nctrl = nctrl,
ptrt = ptrtH, pctrl = pctrl)
@
For moderate smokers, the effect is a little weaker compared with the
results reported on earlier and for heavy smokers, the meta analysis
identifies a stronger effect for heavy smokers. Nevertheless, the
differences between the two rather different approaches are negligible and
the conclusions would have been the same.
\section{Summary of Findings}
We have seen that, using a Bayesian approach to incorporate prior knowledge
into a model, the odds of developing lung cancer increase with increased
amounts of smoking. Of course, our analysis here is very simplistic, because
we ignored that also pipe and cigar smokers were present in the data, we
merged the data based on a very rough assessment of the number of cigarettes
smoked per day, ignored whether or not the smokers inhaled the smoke into
their lungs, or if nonsmokers were subject to passive-smoking, as we call it
today. Most importantly, we must not misinterpret findings from
case-control studies as casual and, in fact, none of the authors cited here
did so. The debate on whether smoking, and which kind of smoking, actually
\stress{causes} lung cancer was initiated by the publications cited in this
chapter and many famous statisticians took part in the debate, for example,
Sir Ronald Fisher \citep{HSAUR:Fisher1959}, took the view that the inference
of causation was premature. In retrospect this was one issue (perhaps the
only one) where Fisher was mistaken.
\section{Final Comments}
There remain a few hard-line opponents of Bayesian inference (just a few)
who reject the method because of the use of subjective prior distributions
which, these opponents feel, have no place in scientific investigations.
And there are Bayesians who think that the only defense of using
non-Bayesian methods is incompetence.
But for an increasing number of statisticians Bayesian inference is very
attractive, because we can use the posterior distribution of the parameters
to draw conclusions from the data. Although this requires the specification
of a prior distribution, we have seen in this chapter that, using data from
previous experiments, priors can be defined in a reasonable way. It is not
absolutely necessary to rely on rather complex numerical procedures to`estimate' a posterior distribution. When we are willing to cut some
corners, we can implement simple Bayesian approaches using standard
software. We should also keep in mind that the prior can be interpreted as
a penalty on the parameters, and many penalization approaches therefore
have an (often implicit) connection to the Bayesian way of doing statistics.
Of course, just picking the prior that `works best' is dangerous and
almost surely inappropriate.
\section*{Exercises}
\begin{description}
\exercise
Produce a forest plot as introduced in Chapter~\ref{MA} for the four
smoking studies analyzed here.
\exercise
Produce a modified forest plot where one can see how the evidence
for smoking being related to lung cancer evolved between 1940 and 1950.
\exercise
Use the \Rpackage{INLA} add-on package to perform a similar analysis by
using the coefficients and their standard errors estimated from
our initial logistic regression model \texttt{m[["Mueller1940"]]} as parameters of a normal prior
for a logistic regression applied to the Schairer and Sch\"oniger data.
Compare the resulting credibility intervals for the two odds-ratios
with the approximate results obtained in this chapter.
\end{description}
%%\bibliographystyle{LaTeXBibTeX/refstyle}
%%\bibliography{LaTeXBibTeX/HSAUR}
\end{document}