| Japanese Journal of Clinical Oncology | Pages |
Basic Requirements For Study Reports
Methods section
Results section
Common Pitfalls In Statistics
Statistical significance
Association and causal relationship
Distribution and underlying assumptions
Categorical variables (qualitative data)
Correlation and regression
Survival analysis
Multivariate techniques
Recommended Textbooks And More Detailed Guidelines
References
A Guideline for Reporting Results of Statistical Analysis in Japanese Journal of Clinical Oncology
The paper gives guidelines to authors on the use of statistics, including statistical considerations when designing studies. Information on this article plus a list of other recommended books is available on the World Wide Web (http://wwwinfo.ncc.go.jp/jjco/) in Japanese as well as English.
During the past three years, 76% of all original articles published in the Japanese Journal of Clinical Oncology have employed some form of statistical analysis. However, despite the increasing importance of statistics in medical presentation, only a few guidelines are available for physicians (1 -3 ). The aim of this guideline is to provide authors with practical information helpful for presenting statistical evaluations of their clinical findings.
Textbooks and journal articles describing statistical methods in more detail are listed in the last part of this guideline. However, it should always be borne in mind that the best way to design and accomplish appropriate statistical analyses is to consult an experienced biostatistician from the early stage of the study.
BASIC REQUIREMENTS FOR STUDY REPORTS
Submitted papers must include the following descriptions.
TYPE OF STUDY
The various types of clinical research are classified as shown in Table 1. This information should also be included in the abstract and preferably in the title.
ENDPOINT DEFINITION
The term `endpoint' is used to denote the observed outcome for individual subjects and used as a measure of outcome to evaluate the results of the study. It should be made clear which study endpoint represents the results and which individual endpoint is used for calculating the study endpoint. There are many possible individual-study endpoint sets, and the most commonly used ones are shown in Table 2 .
1) When methods with more than one option, such as t test (paired or unpaired), analysis of variance, or rank correlation are used, specify which option was selected.
2) The distinction between a one-sided and two-sided test should be made.
3) The paired t test is an alternative equivalent of the two-sample t test with paired data. Similarly, the Mann-Whitney U test and Wilcoxon test for unpaired data should be replaced by the paired (signed) Wilcoxon test, and the simple [chi]2 test for 2 * 2 tables should be replaced by the McNemar test.
STATISTICAL ANALYSIS SOFTWARE USED
When a software package for statistical analysis, such as SAS (Statistical Analysis System) or SPSS (Statistical Package for the Social Sciences) is used, the name and version should be given. Even in such cases, however, the specific statistical methods must still be identified.
Table 1.
Table 2.
One or a few major study endpoints chosen for assessing the primary objective(s) are described as `primary endpoint(s)' and the remaining study endpoints are `secondary endpoints'. Definitions, methods of evaluation and the timing/interval used for assessing each endpoint should be described in the Methods section.
METHODS FOR DATA COLLECTION
The methods used for collecting data, for example, a survey of original medical records, usage of a specific data collection form (case report/record form) or access to a hospital information system or database should be described.
CRITERIA FOR SUBJECT INCLUSION AND EXCLUSION
Selection criteria consist of both inclusion criteria and exclusion criteria. Inclusion criteria are important for indicating the external validity (generalizability), i.e. the extent to which the results can be applied or extrapolated, and exclusion criteria are applied so as not to include subjects/patients who might make evaluation of the study endpoints difficult or for whom participation in the study might not be safe.
METHODS OF ALLOCATION
In an intervention trial, the treatment allocation method must be described in detail, i.e. whether it is randomized or non-randomized, cross-over or sequential, and what is allocated and how.
METHODS USED TO AVOID BIAS OR TO ASSURE STUDY VALIDITY
The methods used to avoid bias or to assure the validity of the study, such as randomization, blinding (masking) or matching, should be fully explained. If evaluation is made under masking, `who was masked to what' should be clearly described. When there is a control group, how and why they were selected should be described.
STATISTICAL METHODS
All statistical methods used in a study should be principally identified with a reference. However, the following very common statistical methods can be included without reference and allowed to appear only in the Results section.
References for the study design and statistical methods should be standard works (e.g. edited books with relevant pages stated, or review articles) as far as possible, rather than original reports of designs or statistical methods.
Notes
DESCRIPTION OF SUBJECTS
Calculation of any summary statistics leads to loss of much information from the original data. It is necessary to present not only summary statistics and testing results, but also the individual data as effectively as possible.
Required individual data are as follows:
1) Number of subjects studied
Although numerators of concerned estimates (proportions, percentages, rates) are always presented, denominators are not always clearly stated. For each calculation of estimates (such as the response rate), denominators, e.g. all entry patients, all eligible patients or patients treated with an entire course of the regimen, should be clearly stated. Any estimate without a presented denominator is unacceptable.
2) Exclusion during analysis
Any exclusion of individuals during analysis leads to loss of comparability and generalizability. The number and reason for exclusion of individuals must be stated, e.g. ineligible patients after registration, deviation from the intended study design, withdrawal from the planned treatment, loss to follow-up etc.
3) Details of treatment complications (adverse events)
It is always necessary to report carefully not only the efficacy but also the safety of a treatment. Safety parameters are the frequency and severity of adverse events, e.g. adverse drug reaction during chemotherapy or complications associated with radiotherapy and surgery. Evaluation of treatment or diagnostic methods must be based on a balance between efficacy and safety.
EXACT P VALUES
Exact P values, such as P = 0.23, are preferable to the term `N.S.' or `not significant'.
P values given in tables or figures need not to be repeated in the text. P values are preferably quoted to three decimal places, e.g. P = 0.025 or P = 0.003, but when P > 0.1, it is sufficient to keep to two decimal places, e.g. P = 0.25. Small P values can be expressed as P < 0.001 or P < 0.0001. Both `p' and `P' are allowed, although the international standard is P (large italic).
POINT ESTIMATES AND CONFIDENCE INTERVALS
Showing confidence intervals is always preferable to reporting P values only. Mean differences in continuous variables, proportions in categorical variables and relative risks including odds ratios and hazard ratios should preferably be accompanied by their confidence intervals.
Notes
1) It is desirable to report the observed values of the test statistics. For example, mean and SD of variables, t statistic and degrees of freedom as well as P value are preferably reported when the t test is used.
2) Mean should not normally be given to more than one decimal place exceeding that of the raw data, but standard deviations or standard errors may need to be quoted to one extra decimal place. It is rarely necessary to quote percentages to more than one decimal place, e.g. 56.47%, and even one decimal place, e.g. 38.6%, is often not needed especially when the sample size is small. For small samples, the use of percentages should be avoided, e.g. 1/1 = 100%, 2/2 = 100%. It is sufficient to quote values of summary statistics, such as t, [chi]2 and r, to two decimal places. Note that these remarks apply only to presentation of results. Rounding up and down should not be done before or during analysis.
This section describes common pitfalls frequently encountered in the medical literature. Authors must be careful to avoid such misinterpretation of their results.
`NOT SIGNIFICANT' IS NOT `EQUIVALENT'
The most common misunderstanding of statistical principles seen in the medical literature is that of statistical significance; `statistically non-significant results' are regarded as the evidence of `equivalence'.
A significance test simply assesses the plausibility of the observed data when a `null hypothesis' (e.g. there being no difference between groups) is true. The P value is the probability that the observed data, or a more extreme outcome, would have occurred by chance. `Statistical significance' can lead to two interpretations: one is that the null hypothesis is true and that a rare phenomenon was observed by chance; the other is that the null hypothesis is false. Based on the latter interpretation, that is, when P is sufficiently small (P < 0.05 or P < 0.01), one rejects the null hypothesis and accepts the alternative one (i.e. `the expected difference between two groups exists').
If P is not sufficiently small (P > 0.05), the null hypothesis cannot be rejected. Such results do not mean that there is no difference, but that it cannot be determined whether or not there is an expected difference.
If a non-significant result indeed meant `no difference', then equivalence could be claimed by investigating as small a population as possible. This is logically absurd.
P VALUE AND SIGNIFICANCE LEVEL
Although the 0.05 level is a common cut-off point for statistical significance (type-I error level: [alpha]), there is no reason why one should alter an interpretation of results with P values of 0.04 from that of results of P = 0.06. It is not always necessary to stick to 0.05 for interpretation of results, and it is better to show the exact P value and the confidence intervals of an estimated parameter.
STATISTICAL SIGNIFICANCE AND CLINICAL IMPORTANCE
If a study sample is large enough, even small differences can become statistically significant. Hypothetically, a randomized clinical trial with more than 1500 enrolled patients can detect a 5% or smaller difference of 5-year surviving proportion. Even if such a difference is `statistically significant', it may not be clinically important for individual patients. On the other hand, if a study sample is small, quite a large difference which fails to reach statistical significance cannot be ignored if it is considered to be clinically (possibly) important. Authors should distinguish `statistical significance' from `clinical importance' and conduct clinical studies to assess clinically important outcomes.
MULTIPLICITY OF TESTING OR REPEATED MEASUREMENTS
The use of many significance tests in a single study, e.g. many subgroup analyses, the use of many testing methods, analysis of many variables, and serial measurements (time course) of the same variable can increase the probability of obtaining false-positive results. Care should therefore be exercised in the study design and in the interpretation of P values in such a situation.
We recommend the following three solutions to avoid multiplicity of testing.
1) Look for other simpler endpoints
The best approach for avoiding multiplicity of testing is to use one or only a few statistics instead of multiple ones wherever possible. For example, the time taken to reach a peak (Fig.1a) and the length of time above a given level (Fig.1b) are as useful as established pharmacological parameters such as the area under the curve (AUC) or the maximum drug concentration (Cmax).
2) Give a conservative interpretation and do not use the term `significant'.
Another approach is to indicate exact P values and to interpret each result conservatively without the term `significant' or `not significant'.
3) Controlling the level of significance
A simple ad hoc method is to use the Bonferroni correction. The idea behind this is that if one were conducting n significance tests, then to obtain an overall type I error rate of [alpha], one would only declare any one of them `significant' if the P value were smaller than [alpha] /n; for example, the significance level is set at 0.01 for 5 subgroup analyses.
There are specific statistical techniques to avoid multiplicity of testing with correlation, and several refined Bonferroni-type adjustments are proposed. However, authors are advised to consult a biostatistician for such advanced methods. Appropriate selection of the method depends on the construction and structure of the hypothesis (objectives of the study) and often requires the advice of an experienced biostatistician.
CONFIRMATORY OR EXPLORATORY?
There is an important distinction to be made between confirmatory and exploratory data analysis.
Only confirmatory data analysis can give conclusive results. This should be performed following a prespecified hypothesis and analysis plan. On the other hand, exploratory data analysis should be used only to find new hypotheses which can be investigated in further studies. Most observational studies are classified as exploratory.
Many medical researchers misunderstand or confuse the difference between these two major types of analysis. The significant results of testing are entirely different according to whether they are archived by the main hypothesis set before analysis or by a subsidiary hypothesis, such as ad hoc subgroup analysis in clinical trials. Results of subgroup analysis, when the subgroups are discovered during data analysis, should always be treated with caution until confirmed by other studies.
The weakness of exploratory data analysis is due mainly to the problem of multiplicity of testing, as mentioned above. The conclusion of any exploratory analysis should be whether or not `further evaluation is needed and worthwhile'.
To reduce the weakness of exploratory analysis, a cross-validation procedure is useful. When the data set is large, it is divided into two data sets, one being used for exploratory analysis and the other for validation of the generated hypothesis. Even when the data set is not large, it is recommended to check the consistency of the results by studying two or more randomly divided data sets.
Table 3.
A statistical association does not in itself provide direct evidence of a `causal relationship' among the variables concerned. In much of the medical literature, statistical association is often confused with causal relationship in the interpretation of the results. Direct causal relationships, e.g. the relationship between vitamin C deficiency and scurvy, can rarely be proved without an intervention trial (randomized trial). However, conducting intervention trials is not easy in some clinical situations and in most epidemiological fields. Causality can be established only on non-statistical grounds in observational studies. Campbell (4 ) proposed several points (Table 3 ) which strengthen the argument that the relationship is causal.
Some methods, such as the t test, depend on an assumption that the data show an approximately normal (Gaussian) distribution and that two groups have almost the same variability represented by standard deviation. If these assumptions do not hold, validity ([alpha] -error) and power (1-[beta]-error) are not guaranteed. When data show a skewed (asymmetrical) distribution or the variability is considerably different across groups, some transformation before testing or the usage of alternative `distribution free (non-parametric)' methods is recommended.
Many biomedical variables are positively skewed (often right-sided skew); examples are tumor size and serum levels of tumor markers. Empirically, logarithmic (log) transformation is often effective for obtaining an approximately normal distribution (Fig.2). After analysis, it is desirable to convert the results back to the original scale for reporting when transformation is used.
SUMMARY STATISTICS AND DISTRIBUTION
Only when the distribution is symmetric can the mean and standard deviation be used as summary statistics. Otherwise, it is recommended that the median, percentiles (interquartile range, 75 percentile - 25 percentile) or a range covering some (95% or 90%) proportion of the data be used.
Outliers
Outliers' are observations that are highly inconsistent with the main body of the data. Checking outliers before analysis is extremely important because outliers sometimes represent an invalid data entry, such as errors in writing medical records as well as data forms and keyboard entries. Any outliers remaining after checking for mis-entry should be carefully handled at all stages of the data analysis.
Outliers may influence estimates and testing methods, especially those based on a parametric approach, such as mean, t test, correlation and regression analysis. No outlier should be omitted from analysis purely on the basis of some `outlier' test. For highly implausible outliers, results both with and without outliers may be reported.
Categorical variables are nominal or ordered. The former include sex and blood type, and the latter include stage I, II, III, IV and CR, PR, NC, PD. For ordered categorical data, calculation of the means and standard deviation is inappropriate; instead, proportions should be reported.
Methods section
1. Observational study
1) Case series study
2) Cross-sectional study
3) Longitudinal study
a) Case-control study
b) Cohort study (prospective, retrospective, historical)
c) Nested case-control study
2. Intervention study (trial)
1) Controlled study
a) Parallel (randomized, non-randomized)
b) Sequential (self-controlled, cross-over)
c) Historical control
2) Uncontrolled study
Individual endpoint
Study endpoint
Death
Proportion surviving (`survival rate')
Mortality
Duration of survival/time to death
Survival (distribution)
Recurrence
Proportion of recurrence (`recurrence rate')
Duration of relapse-free survival/
time to recurrenceRelapse-free survival (distribution)
(Objective tumor) response
(Objective tumor) response rate
Toxicity
Toxicity spectrum
Dose-limiting toxicity
Maximum tolerated dose
Adverse event (complication)
Morbidity
Diagnosis of cancer
Prevalence, incidence
Test positive for diseased subjects
Sensitivity
Test negative for non-diseased subjects
Specificity
t test
All other statistical methods must be described and referred to in the Methods section.
simple chi-square(d) ([chi]2) test
Wilcoxon or Mann-Whitney U-test
correlation and linear regression
Results section
COMMON PITFALLS IN STATISTICS
Statistical significance
1) Consistency
 
Other investigators and other studies of different populations led to similar conclusions.
2) Plausibility
 
The conclusions are biologically plausible.
3) Dose-response
 
A `dose-response' relation is found, such as a heavier exposure to a risk factor is associated with a greater risk of disease.
4) Temporality
 
The disease incidence increases or decreases following increasing or decreasing exposure.
5) Strength of the relationship
 
A large relative risk, odds ratio or hazard ratio may be more convincing than a small one.
Association and causal relationship
Distribution and underlying assumptions
Categorical variables (qualitative data)
Correlation and regression
Correlation analysis is a useful technique for dealing with the relationship between two continuous variables. However, it is the statistical method that is most misused in the medical literature. Authors are recommended to display a scatter diagram whenever they use `correlation', especially when significance of correlation coefficients or non-zero correlation is a major finding or the main point of discussion in the paper.
In regression analysis, it is desirable to present a fitted regression line together with a scatter diagram of the raw data.
Authors who intend to use correlation should pay attention to the following points:
1) Comparison of diagnostic methods
The correlation coefficient is inappropriate for comparing alternative methods of measuring the same variable, e.g. CEA determined by two methods, since it assesses not association but agreement. In such cases, authors should not discuss association or correlation, but whether or not the differences between values in each observation are acceptable. The `limits of agreement' should be judged from a clinical, not from a statistical, viewpoint. It is recommended that a scatter diagram of difference (Fig.3a) is shown as well as that of compared variables. For calculation of calibration formulae, a specific method called functional relationship analysis (or structural relationship analysis) is recommended instead of regression analysis, because regression analysis ignores measurement error for one variable.
Survival analysis
MEDIAN RATHER THAN MEAN
The calculation of `mean survival time' is inadvisable because survival data mostly include `censored' observations, and the distribution of survival time is not normal but positively skewed. The median provides a better summary statistic than the mean in survival analysis.
AVOID TESTING `SURVIVAL RATE'
Direct comparison of the proportions surviving (so-called `survival rate') at a certain time point, such as the 5-year survival rate, using the [chi]2 test is invalid when censored observations are included in the data. Instead, it is desirable to use testing methods that can deal with the entire survival distribution, such as the logrank test or generalized Wilcoxon test.
STARTING POINT AND `EVENT'
The starting point of the observation and the definition of `event' should be clearly mentioned. For example, `disease-free survival time was calculated from the date of randomization to the following event: 1) first documented evidence of cancer recurrence confirmed by biopsy, 2) second primary cancer, or 3) death from any cause without prior evidence of cancer recurrence or second primary cancer.'
COMPARISONS OF SURVIVAL BETWEEN `RESPONDERS' AND `NON-RESPONDERS'
Comparison of survival between responders and non-responders for certain treatments, chemotherapy or radiotherapy, does not give conclusive results because of bias due to the time required to observe the `response' (5 ). This journal does not accept such misuse.
PROGNOSTIC FACTORS
Prognostic factors' are those which have some prognostic contribution after adjustment for other factors. Factors affecting patient survival only in univariate analysis should not be regarded as `prognostic factors'. Multivariate techniques, some adequate form of stratified analysis or subgroup analysis are needed to investigate such prognostic contribution. However, multivariate techniques tend to give liberal (speculative) estimates because of multiplicity of testing, as mentioned above. In exploratory analysis for prognostic factors, authors should ensure they are conservative in interpreting the results.
Multivariate techniques
Recently, rapid developments of computer hardware and software have allowed easy use of powerful packaged software which supports multivariate analysis techniques. Multivariate techniques are useful methods for dealing with more than one outcome variable simultaneously, and adjustment for many confounding factors (covariates). However, most multivariate techniques have underlying assumptions and restrictions, such as the parametric model, semi-parametric model or proportionality of hazard rate. These tend to be ignored in much of the medical literature and the results are sometimes misinterpreted. Multivariate techniques really require expert help, and it is strongly recommended that authors consult a biostatistician before using them. Details of how to use multivariate techniques are beyond the scope of this guideline.
The most important issue in statistics is not how to present one's results or analysis, but how to design a study. A well designed study, poorly analyzed, can be rescued by re-analysis, whereas a poorly designed study is beyond redemption, even using sophisticated statistics (6 ). If you consult a biostatistician at your institute, do so at the beginning of your study, and not after completion of the data set.
We hope that this guideline will be of help to readers in starting to learn statistics and make it easier for authors and editors to discuss statistics form the same point of view.
RECOMMENDED TEXTBOOKS AND MORE DETAILED GUIDELINES
1.
2.
3.
4.
5.
6.
7.
8.
9.
Information on this article and other related books is available on the World Wide Web (http://wwwinfo.ncc.go.jp/jjco/) in Japanese as well as English.
References
This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals
Comments and feedback: www-admin{at}oup.co.uk
Last modification: 19 May 1998
Copyright© Japanese Journal of Clinical Oncology, 1997.
This article has been cited by other articles:
![]() |
J.-S. Park, S.-Y. Oh, S.-H. Kim, H.-C. Kwon, J.-S. Kim, H. Jin-Kim, and Y.-H. Kim Single-agent Gemcitabine in the Treatment of Advanced Biliary Tract Cancers: a Phase II Study Jpn. J. Clin. Oncol., February 1, 2005; 35(2): 68 - 73. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



