An official website of the United States government
Official websites use .gov A .gov website belongs to an official government organization in the United States.
Secure .gov websites use HTTPS A lock ( Lock Locked padlock icon ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.
- Publications
- Account settings
- Advanced Search
- Journal List
Acquiring data in medical research: A research primer for low- and middle-income countries
Vicken totten, erin l simon, mohammad jalili, hendry r sawe.
- Author information
- Article notes
- Copyright and License information
Corresponding author. [email protected] [email protected]
Received 2020 Feb 21; Revised 2020 Aug 8; Accepted 2020 Sep 5; Issue date 2020.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Without data, there is no new knowledge generated. There may be interesting speculation, new paradigms or theories, but without data gathered from the universe, as representative of the truth in the universe as possible, there will be no new knowledge. Therefore, it is important to become excellent at collecting, collating and correctly interpreting data. Pre-existing and new data sources are discussed; variables are discussed, and sampling methods are covered. The importance of a detailed protocol and research manual are emphasized. Data collectors and data collection forms, both electronic and paper-based are discussed. Ensuring subject privacy while also ensuring appropriate data retention must be balanced.
Keywords: Data collection, Sampling, Variables, Data storage
African relevance
To get good quality information you first need good quality data
Data collection systematically and reproducibly gathers and measures variables to answer research questions.
Good data is a result of a well thought out study protocol
The International Federation for Emergency Medicine global health research primer
This paper forms part 9 of a series of ‘how to’ papers, commissioned by the International Federation for Emergency Medicine. It describes data sources, variables, sampling methods, data collection and the value of a clear data protocol. We have also included additional tips and pitfalls that are relevant to emergency medicine researchers.
Data collection is the process of systematically and reproducibly gathering and measuring variables in order to answer research questions, test hypotheses, or evaluate outcomes.
Data is not information. To get good quality information you first need good quality data, then you must curate, analyse and interpret it. Data is comprised of variables. Data collection begins with determining which variables are required, followed by the selection of a sample from a certain population. After that, a data collection tool is used to collect the variables from the selected sample, which is then converted into a data spreadsheet or database. The analysis is done on the database.
Sometimes you gather data yourself. Sometimes you analyse data others collected for different purposes. Ideally, you collect a universal sample, that is, 100%. In real life, you get a limited sample. Preferably, it will be a truly random sample with enough power to answer your question. Unfortunately, you may have to settle for consecutive or convenience sampling. Ideally, your data collectors would be blinded to the outcome of interest, to prevent bias. However, real life is full of biases. Imperfect data may be better than no data; you can often get useful information from imperfect data. Remember the enemy of good is perfect.
Why is good data important?
Acquiring data is the most important step in a research study. The best design with bad data is useless. Bad design produces bad data. The most sophisticated analysis cannot be performed without data; analysing bad data produces erroneous results. Analysis can never be better than the quality of the data on which it was run. Good data has integrity. Data integrity is paramount to learning “Truth in the Universe”. Good data is as complete and as clean, as you can reasonably make it. Clean data ‘has integrity’ when the variables access as much relevant information as possible, and in the same way for each subject.
Some information is very hard to get. You may have to use proxy variables for what you really want to know. A proxy variable is a variable that is not in itself directly relevant, but that serves in place of an unobservable or immeasurable variable. In order for a variable to be a good proxy, it must have a close correlation, not necessarily linear, with the variable of interest. One example for the variable of a specific illness might be a medication list.
Consequences of bad data include an inability to answer the research question; inability to replicate or validate the study; distorted findings and wasted resources; compromised knowledge and even harm to subjects.
Ensure data quality
Good data is a result of a well-thought-out study protocol, which is the written plan for the study. Good planning is the most cost-effective way to ensure data integrity. Good planning is documented by a thorough and detailed protocol, with a comprehensive procedures manual. Poorly written manuals risk incomplete or inconsistent collection of data, in other words, ‘bad data’. The manual should include rigorous, step-by-step instructions on how to administer tests or collect the data. It should cover the ‘who’ (the subject and the researcher); the ‘when’ (the timing), the ‘how’ (methods), and the ‘what’ (a complete listing of variables to be collected). There should also be an identified mechanism to document any changes in procedures that may evolve over the course of the investigation. The study design should be reproducible: so that the protocol can be followed by any other researcher. All data needs to be gathered in the same way. Test (trial-run) your manual before you start your study. If data is collected by several people, make sure there is a sufficient degree of inter-rater reliability.
To get good data, your sample needs to be representative of the population. For others to apply your results, you need to characterize your population, so others can decide if your conclusions are relevant to their population (see Sampling section, below).
Data integrity demands you supervise your study, making sure it is complete and accurate. You may wish to do interim analyses. Keep copies! Keep both the raw data and the data sheets, for the length of time required by law or by Good Research Practice in your country. This will protect you from accusations of falsification of data.
In real life, you may have to deal with any number of sampling and data collection biases. Some of these biases can be measured statistically. Regardless, all the limitations you can think of should be written in your limitations section. The best design you can practically use gives you the best data you can reasonably get. Remember, “you cannot fix with statistics what you fouled up by design.”
Before you acquire your first datum, consider: Do you have a developed protocol and a research manual? Have you sought Ethics Board approval? Do you have an informed consent? Do you have a plan to protect the subject's confidentiality? Do you have a plan for data analysis? Where will you safely store and protect the data? If you have collaborators, have you established, in writing, who owns the data, and who has the right to analyse and publish it?
Types of data: qualitative vs. quantitative data
Numerical data is generally called quantitative; if in words or sentences, it is qualitative. Medical research historically has focused on quantitative methods. Generally, quantitative research is cheaper, easier to gather and easier to analyse. For purposes of this chapter, we will focus on quantitative research.
Qualitative research is about words, sentences, sounds, feeling, emotions, colours and other elements that are non-quantifiable. It requires human intellect to extract themes from the sentences, evaluate the fit of the data to the themes, and to draw the implications of the themes. Primary sources for qualitative data include open ended surveys, interviews, and public meetings. Qualitative research is more common in politics and the social sciences, and will not be further discussed here, except to refer you to other sources.
Quantitative research can include questionnaires with closed-ended questions (open ended questions belong in qualitative research). The data is transformed into numbers and will be analysed with parametric and non-parametric statistical tests. In general, you will derive a mean, mode and median; you will calculate probabilities, make correlation and regressions in order to draw conclusions.
Sources of data: primary vs secondary data
To answer a research question, there are many potential sources of data. Two main categories are primary data and secondary data. Primary data is newly collected data; it can be gathered directly from people's responses (surveys), or from their biometrics (blood pressure, weight, blood tests, etc.). It is still considered primary data if you gather data that was collected for other (medical) purposes by extracting the data from medical records. Medical records can be a rich source of data, but data extraction by hand takes a lot of time.
Secondary data already exists; it has already been published or complied. There are extant local, regional, national and international databases such as Trauma Registries, Disease-specific Registries, Public Health Data, government statistics, and World Health Organization data. Locally, your hospital or clinic may already keep statistics on any number of topics. Combining information from disparate databases may sometimes yield interesting results. For example, in the US, the Centers for Disease Control and Prevention keeps databases of reportable diseases, accidents, causes of death and much more. The US Geographic Survey reports the average elevation of American cities. Combining the two databases revealed that, even when gun ownership, drug and alcohol use were statistically controlled for, there was a linear correlation between altitude and suicide rates [ 2 ]. Reno et al., reviewed the existing medical literature (also secondary data), and confirmed the correlation and concluded that the mechanisms have yet to be elucidated [ 3 ].
Collecting good data is often the hardest part of research. Ideally, you would want to collect 100% of the data (universal sampling to reflect target population). One example would be ‘all elderly persons with gout’. In real life, you have access to only a subset of the target population (the accessible population). Further, in your study you will be limited to a subset of the accessible population (the study population). Again, in the ideal world, that limited sample would be truly random, and have enough power to answer your question. You can find free random number generators online. In real life, you may have to settle for consecutive or convenience sampling. Of the two, consecutive sampling has less bias. Sometimes it is important to balance your groups. You may have 2 or 3 treatments (or interventions) and want to have an equal number of each kind. So, you create blocks — of a few times the number of treatments. You randomized within the block. Each time a block is filled, you are assured that you have the right balance of subjects. Blocks are often in groups of six, eight or 12. This is called balanced allocation .
If you must get only a convenience sample – for example because you only have a single data gatherer and can get data only when that person is available – you should, at a minimum, try to get some simple demographics from times when the data gatherer is not available, to see if subjects at that other time are systematically different. For example, if you are looking at injuries, people who are injured when drinking on a Friday night might be systematically different from people who are injured on their way to work on a Monday morning. If you can only collect injury data in the morning, your results will be biased.
Variables are the bits of data you collect. They change from subject to subject and describe the subject numerically. Age (or year of birth); gender; ethnic group or tribe; and geographic location are commonly called simple demographic variables and should be collected and reported for most populations.
Continuous variables are quantified on a continuous scale, such as body weight. Discrete variables use a scale whose units are limited to integers (such as the number of cigarettes smoked per day). Discrete variables have a number of possible values and can resemble continuous variables in statistical analysis and be equivalent for the purpose of designing measurements. A good general rule is to prefer continuous variables because the information they contain provides additional information and improves statistical efficiency (more study power and smaller sample size).
Categorical variables are those not suitable for quantification. They are often measured by classifying them into categories. If there are two possible values (dead or alive), they are dichotomous. If there are more than two categories, they can be classified according to the type of information they provide (polytomous).
Research variables are either predictor (independent) or outcome (dependent) variables. The predictor variables might include such things as “Diabetes, Yes/No”, “Age over 65 — Yes/No”, and “diagnosis of hypertension” (again, Yes/No). The respective outcome might be “lower limb amputation” or “death within 10 years”. Your question might have been, “How much additional risk of amputation does a diagnosis of hypertension add in a person with diabetes?”
Before analysis, variables are coded into numbers and entered into a database. Your Research Manual should describe how to code all the data. When the variables are binary, (male/female; alive/dead) coding them into “0” and “1” makes analysing the data much easier (“1” versus “2” makes it harder). The easiest variables for computers to analyse are binary. In other words, “0” or “1”. Such variables are Yes/No; True/False; Male/Female; 65 or over / under 65, etc. The next easiest are ordinal integers: 1, 2, 3, etc. You might create ordinal numbers from categories (0–9; 10–19; 20–29 years of age, etc.), but in order to be ordinal, they require an obvious sequence. Categorical variables do not have an intrinsic order. “Green” “Brown” and “Orange” are non-ordinal, categorical variables. It is possible to transform categorical variables into binary variables, by making columns where only one of the answers is marked with a “1” (if that variable is present) and all the others are marked “0”. The form of the variables and their distribution will determine the type of statistical analysis possible. Data which must be transformed or cleaned is more prone to error in the cleaning or transformation process.
There are alternative ways to get similar information. For example, if you wanted to know the HIV status of each of your subjects, you could either test each one, or you could ask them. The tests cost more, however; they are less likely to give biased results. How you gather each variable will depend on your resources and will inform the limitations of your study.
Precision of a variable is the degree to which it is reproducible with nearly the same value each time it is measured. Precision has a very important influence on the power of a study. The more precise a measurement, the greater the statistical power of a given sample size to estimate mean values and test your hypotheses. In order to minimize random error in your data, and increase the precision of measurements, you should standardize your measurement methods; train your observers; refine any instruments you may use (such as calibrating instruments); automate instruments when possible (automated blood pressure cuff instead of manual); and repeat your measurements.
Accuracy of the variable is the degree to which it actually represents what it is intended to (Truth in the Universe). This influences the validity of the study. Accuracy is impacted by systemic error (bias). The greater the error, the less accurate the variable. Three common biases are: observer bias (how the measurement is reported); instrument bias (faulty function of an instrument); and subject bias (bad reporting or recall of the measurement by the study subject).
Validity is the degree to which a measurement represents the phenomenon of interest. When validating an abstract concept, search the literature or consult with experts so you can find an already validated data collection instrument (such as a questionnaire). This allows your results to be comparable to prior studies in the same area and strengthens your study methods.
Research manual
Simple research with limited resources does not need a research manual, just a protocol. Nor is there much need if the primary investigator is the only data gatherer and analyser. However, if several persons gather data, it is important that the data be gathered the same way each time.
Prevention is the most cost-effective activity that will ensure the integrity of data collection. A detailed and comprehensive research manual will standardize data collection. Poorly written manuals are vague and ambiguous.
The research manual is based off your protocol. The manual should spell out every step of the data collection process. It should include the name of each variable and specific details about how each variable should be collected. Contingents should be written. For example: “If the patient does not have a left arm, the blood pressure may be taken on the right arm. If the patient has no arms, leg blood pressures may be recorded, but put an ‘*’ beside the reading.” The manual should also include every step of the coding process. The coding manual should describe the name of each variable, and how it should be coded. Both the coder and the statistician will want to refer to that section. The coding section should describe how each variable will be entered into the database. Test the manual to make sure everyone understands it the same way.
Think about various ways a plan can go wrong. Write them down, with preferred solutions. There will always be unexpected changes. They should be added into the manual on a continuing basis. An on-going section where questions, problems and their solutions are all recorded will increase the integrity of your research.
Data collection methods
Before you start data collection, you need to ask yourself what data you are going to collect and how you are going to collect them. Which data, and the amount of data to be collected needs to be defined clearly. Different people (including several data collectors) should have a similar understanding of each variable and how it is measured. Otherwise, the data cannot be relied on. Furthermore, the decision to collect a piece of data needs to be justified. The amount of data collected for the study should be sufficient. A common mistake is to collect too much data without actually knowing what will be done with it. Researchers should identify essential data elements and eliminate those that may seem interesting but are not central to the study hypothesis. Collection of the latter type of data places an unnecessary burden on both the study participants and data collectors.
Different data collection approaches which are commonly used in the conduct of clinical research include questionnaire surveys, patient self-reported data, proxy/informant information, hospital and ambulatory medical records, as well as the collection and analysis of biologic samples. Each of these methods has its own advantages and disadvantages.
Surveys are conducted through administration of standardized or home-grown questionnaires, where participants are asked to respond to a set of questions as yes/no, or perhaps on a Likert type scale. Sometimes open-ended responses are elicited.
Medical records can be important sources of high-quality data and may be used either as the only source of data, or as a complement to information collected through other instruments. Unfortunately, due to the non-standardized nature of data collection, information contained in the medical records may be conflicting or of questionable accuracy. Moreover, the extent of documentation by different providers can vary significantly. These issues can make the construction or use of key study variables very difficult.
Collection of biological materials, as well as various imaging modalities, from the study participants are increasingly being used in clinical research. They need to be performed under standardized conditions, and ethical implications should be considered.
Data collection tool
You may need to collect information on paper. If you do, it is useful to have the actual code which should be entered into the computerized database written on the forms themselves (as well as in the manual). If you have access to an electronic database such as REDcap [a web-based application developed by Vanderbilt University to capture data for clinical research and create databases and projects [ 4 ], you can enter the data directly as you get them ( male ; female ) and the database will automatically convert the data into code. This reduces transcribing errors. Another common electronic database is Excel, which can also be used to manipulate the data. In spite of the advantages of recording data electronically, such as directly into REDcap or Excel, there are advantages to collecting and keeping the original data on paper. Paper data collection forms can be saved for audit or quality control. Furthermore, paper records cannot be remotely hacked. Moreover, if the anonymous electronic database is compromised or corrupted, you can re-create your database.
Data collectors
Good data collectors are worth gold. If they are thorough and ethical, you will get great data. If not, your data may be unusable. Make sure they understand research ethics, the need for protection of human subjects, and the privacy of data. Ideally, your data collectors would be blinded to the outcome of interest, to prevent bias. It is ok to blind data collectors to the research question, but they need to understand that collecting every variable the same way for each subject is essential to data integrity.
Data gatherers should be trained in advance of collecting any data. They need to understand informed consent and have the time to explain a study to the satisfaction of the subjects. The importance of conducting a dry run in an attempt to anticipate and address issues that can arise during data collection cannot be over-stated. It would even be worthwhile to pilot the research manual, to learn if everyone understands it the same way.
Data storage
Data collection, done right, protects the confidentiality of the subject as well as the data. Data must also be properly stored safely and securely. It is reasonable to back up your data in a different, secure, location. You do not want to go to all the trouble of creating a protocol, collecting your data, only to lose it, or have no way to analyse it!
There are many reasons to keep your data safe and secure. Obviously, you do not want to lose your data. You may wish to use the data again. For example, you may wish to combine it with other data for a different study. An additional reason is that you do not want your subjects to risk a ‘loss of privacy’. Still another reason is that institutions and governments may require you to store data for a specified number of years. Know how long you must keep your data. Keep it in a locked cabinet in a secure room, or behind an institutional firewall.
Furthermore, if you keep a cipher , that is, a connector between a subject and their study number, keep that cipher separate from the research data. That way, even if someone learns that subject 302 has an embarrassing condition, they will not know who subject 302 really is.
These days, almost everyone has access to computers and programs, locally or ‘in the cloud’. For statistical analysis, you will need to have your data in electronic form. If you started with paper, consider double entry (two data extractors for each record, then compare the two) for greater accuracy.
Tips on this topic and pitfalls to avoid
Hazard: no research manual.
No identified mechanism to document changes in procedures that may evolve over the course of the investigation.
Vague description of data collection instruments to be used in lieu of rigorous step-by-step instructions on administering tests
Only a partial listing of variables to be collected
Forgetting to put instructions on the data collection sheet about how to code the data when transferring to an electronic medium.
Hazard: no assistant training
Failure to adequately train data collectors
Failure to do a Dry Run/Failure to try enrolling a mock subject
Uncertainty about when, how and who should review gathered data.
Hazard: failure to understand data management
Data should be easy to understand, and the protocol good enough that another researcher can repeat the study.
Data audit: keep raw data and collected data
Failure to keep backups
Annotated bibliography
RCR Data Acquisition and Management. This online book is pretty comprehensive. http://ccnmtl.columbia.edu/projects/rcr/rcr_data/foundation/ (Accessed 2019 June 23)
Qualitative research – Wikipedia: en.wikipedia.org/wiki/Qualitative_research (Accessed 2019 June 23) – this is a good overview with references so you can delve deeper if you wish.
Qualitative Research: Definition, Types, Methods and Examples: https://www.questionpro.com/blog/qualitative-research-methods/ (Accessed 2019 June 23) – this is a good overview with references so you can delve deeper if you wish.
Qualitative Research Methods: A Data Collector's Field Guide: https://course.ccs.neu.edu/is4800sp12/resources/qualmethods.pdf (Accessed 2019 June 23) – another on-line resource about data collection.
Additional reading about statistical variables
Types of Variables in Statistics and Research: A List of Common and Uncommon Types of Variables. https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/types-of-variables/
Research Variables: Dependent, Independent, Control, Extraneous & Moderator. https://study.com/academy/lesson/research-variables-dependent-independent-control-extraneous-moderator.html
Knatterud GL. Rockhold FW. George SL. Barton FB. Davis CE. Fairweather WR. Honohan, T. Mowery R. O'Neill R. (1998). Guidelines for quality assurance in multicenter trials: a position paper. Controlled Clinical Trials, 19:477–493.
Whitney CW. Lind BK. Wahl PW. (1998). Quality assurance and quality control in longitudinal studies. Epidemiologic Reviews, 20 [ 1 ]: 71–80.
Additional relevant information to consider
Consider who owns the data before and after collection (this brings up questions of consent, privacy, sponsorship and data-sharing, most of which are beyond the scope of this paper).
Authors' contribution
Authors contributed as follow to the conception or design of the work; the acquisition, analysis, or interpretation of data for the work; and drafting the work or revising it critically for important intellectual content: ES contributed 70%; VT, MJ and HS contributed 10% each. All authors approved the version to be published and agreed to be accountable for all aspects of the work.
Declaration of competing interest
The authors declared no conflicts of interest.
- 1. Dudovskiy John. The ultimate guide to writing a dissertation in business studies: a step by step assistance. https://research-methodology.net/ E-book.
- 2. Brenner Barry, Cheng David, Clark Sunday, Jr Carlos A. Camargo. Positive association between altitude and suicide in 2584 U.S. counties. High Alt Med Biol. 2011;12:31–35. doi: 10.1089/ham.2010.1058. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 3. Reno E., Brown T.L., Betz M.E., Allen M.H., Hoffecker L., Reitinger J. Suicide and high altitude: an integrative review. High Alt Med Biol. 2018;19(2):99–108. doi: 10.1089/ham.2016.0131. Jun. [Epub 2017 Nov 21] [ DOI ] [ PubMed ] [ Google Scholar ]
- 4. Patridge E.F., Bardyn T.P. Research electronic data capture (REDCap) J Med Libr Assoc. 2018;106(1):142–144. doi: 10.5195/jmla.2018.319. [ DOI ] [ Google Scholar ]
- View on publisher site
- PDF (364.5 KB)
- Collections
Similar articles
Cited by other articles, links to ncbi databases.
- Download .nbib .nbib
- Format: AMA APA MLA NLM
Add to Collections
An official website of the United States government
The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
- Publications
- Account settings
- Browse Titles
NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.
StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2024 Jan-.
StatPearls [Internet].
Types of variables and commonly used statistical designs.
Jacob Shreffler ; Martin R. Huecker .
Affiliations
Last Update: March 6, 2023 .
- Definition/Introduction
Suitable statistical design represents a critical factor in permitting inferences from any research or scientific study. [1] Numerous statistical designs are implementable due to the advancement of software available for extensive data analysis. [1] Healthcare providers must possess some statistical knowledge to interpret new studies and provide up-to-date patient care. We present an overview of the types of variables and commonly used designs to facilitate this understanding. [2]
- Issues of Concern
Individuals who attempt to conduct research and choose an inappropriate design could select a faulty test and make flawed conclusions. This decision could lead to work being rejected for publication or (worse) lead to erroneous clinical decision-making, resulting in unsafe practice. [1] By understanding the types of variables and choosing tests that are appropriate to the data, individuals can draw appropriate conclusions and promote their work for an application. [3]
To determine which statistical design is appropriate for the data and research plan, one must first examine the scales of each measurement. [4] Multiple types of variables determine the appropriate design.
Ordinal data (also sometimes referred to as discrete) provide ranks and thus levels of degree between the measurement. [5] Likert items can serve as ordinal variables, but the Likert scale, the result of adding all the times, can be treated as a continuous variable. [6] For example, on a 20-item scale with each item ranging from 1 to 5, the item itself can be an ordinal variable, whereas if you add up all items, it could result in a range from 20 to 100. A general guideline for determining if a variable is ordinal vs. continuous: if the variable has more than ten options, it can be treated as a continuous variable. [7] The following examples are ordinal variables:
- Likert items
- Cancer stages
- Residency Year
Nominal, Categorical, Dichotomous, Binary
Other types of variables have interchangeable terms. Nominal and categorical variables describe samples in groups based on counts that fall within each category, have no quantitative relationships, and cannot be ranked. [8] Examples of these variables include:
- Service (i.e., emergency, internal medicine, psychiatry, etc.)
- Mode of Arrival (ambulance, helicopter, car)
A dichotomous or a binary variable is in the same family as nominal/categorical, but this type has only two options. Binary logistic regression, which will be discussed below, has two options for the outcome of interest/analysis. Often used as (yes/no), examples of dichotomous or binary variables would be:
- Alive (yes vs. no)
- Insurance (yes vs. no)
- Readmitted (yes vs. no)
With this overview of the types of variables provided, we will present commonly used statistical designs for different scales of measurement. Importantly, before deciding on a statistical test, individuals should perform exploratory data analysis to ensure there are no issues with the data and consider type I, type II errors, and power analysis. Furthermore, investigators should ensure appropriate statistical assumptions. [9] [10] For example, parametric tests, including some discussed below (t-tests, analysis of variance (ANOVA), correlation, and regression), require the data to have a normal distribution and that the variances within each group are similar. [6] [11] After eliminating any issues based on exploratory data analysis and reducing the likelihood of committing type I and type II errors, a statistical test can be chosen. Below is a brief introduction to each of the commonly used statistical designs with examples of each type. An example of one research focus, with each type of statistical design discussed, can be found in Table 1 to provide more examples of commonly used statistical designs.
Commonly Used Statistical Designs
Independent Samples T-test
An independent samples t-test allows a comparison of two groups of subjects on one (continuous) variable. Examples in biomedical research include comparing results of treatment vs. control group and comparing differences based on gender (male vs. female).
Example: Does adherence to the ketogenic diet (yes/no; two groups) have a differential effect on total sleep time (minutes; continuous)?
Paired T-test
A paired t-test analyzes one sample population, measuring the same variable on two different occasions; this is often useful for intervention and educational research.
Example : Does participating in a research curriculum (one group with intervention) improve resident performance on a test to measure research competence (continuous)?
One-Way Analysis of Variance (ANOVA)
Analysis of variance (ANOVA), as an extension of the t-test, determines differences amongst more than two groups, or independent variables based on a dependent variable. [11] ANOVA is preferable to conducting multiple t-tests as it reduces the likelihood of committing a type I error.
Example: Are there differences in length of stay in the hospital (continuous) based on the mode of arrival (car, ambulance, helicopter, three groups)?
Repeated Measures ANOVA
Another procedure commonly used if the data for individuals are recurrent (repeatedly measured) is a repeated-measures ANOVA. [1] In these studies, multiple measurements of the dependent variable are collected from the study participants. [11] A within-subjects repeated measures ANOVA determines effects based on the treatment variable alone, whereas mixed ANOVAs allow both between-group effects and within-subjects to be considered.
Within-Subjects Example: How does ketamine effect mean arterial pressure (continuous variable) over time (repeated measurement)?
Mixed Example: Does mean arterial pressure (continuous) differ between males and females (two groups; mixed) on ketamine throughout a surgical procedure (over time; repeated measurement)?
Nonparametric Tests
Nonparametric tests, such as the Mann-Whitney U test (two groups; nonparametric t-test), Kruskal Wallis test (multiple groups; nonparametric ANOVA), Spearman’s rho (nonparametric correlation coefficient) can be used when data are ordinal or lack normality. [3] [5] Not requiring normality means that these tests allow skewed data to be analyzed; they require the meeting of fewer assumptions. [11]
Example: Is there a relationship between insurance status (two groups) and cancer stage (ordinal)?
A Chi-square test determines the effect of relationships between categorical variables, which determines frequencies and proportions into which these variables fall. [11] Similar to other tests discussed, variants and extensions of the chi-square test (e.g., Fisher’s exact test, McNemar’s test) may be suitable depending on the variables. [8]
Example: Is there a relationship between individuals with methamphetamine in their system (yes vs. no; dichotomous) and gender (male or female; dichotomous)?
Correlation
Correlations (used interchangeably with ‘associations’) signal patterns in data between variables. [1] A positive association occurs if values in one variable increase as values in another also increase. A negative association occurs if variables in one decrease while others increase. A correlation coefficient, expressed as r, describes the strength of the relationship: a value of 0 means no relationship, and the relationship strengthens as r approaches 1 (positive relationship) or -1 (negative association). [5]
Example: Is there a relationship between age (continuous) and satisfaction with life survey scores (continuous)?
Linear Regression
Regression allows researchers to determine the degrees of relationships between a dependent variable and independent variables and results in an equation for prediction. [11] A large number of variables are usable in regression methods.
Example: Which admission to the hospital metrics (multiple continuous) best predict the total length of stay (minutes; continuous)?
Binary Logistic Regression
This type of regression, which aims to predict an outcome, is appropriate when the dependent variable or outcome of interest is binary or dichotomous (yes/no; cured/not cured). [12]
Example: Which panel results (multiple of continuous, ordinal, categorical, dichotomous) best predict whether or not an individual will have a positive blood culture (dichotomous/binary)?
The table provides more examples of commonly used statistical designs by providing an example of one research focus and discussing each type of statistical design (see Table. Types of Variables and Statistical Designs).
- Clinical Significance
Though numerous other statistical designs and extensions of methods covered in this article exist, the above information provides a starting point for healthcare providers to become acquainted with variables and commonly used designs. Researchers should study types of variables before determining statistical tests to obtain relevant measures and valid study results. [6] There is a recommendation to consult a statistician to ensure appropriate usage of the statistical design based on the variables and that the assumptions are upheld. [1] With the variety of statistical software available, investigators must a priori understand the type of statistical tests when designing a study. [13] All providers must interpret and scrutinize journal publications to make evidence-based clinical decisions, and this becomes enhanced by a limited but sound understanding of variables and commonly used study designs. [14]
- Nursing, Allied Health, and Interprofessional Team Interventions
All interprofessional healthcare team members need to be familiar with study design and the variables used in studies to accurately evaluate new data and studies as they are published and apply the latest data to patient care and drive optimal outcomes.
- Review Questions
- Access free multiple choice questions on this topic.
- Comment on this article.
Types of Variables and Statistical Designs. Contributed by M Huecker, MD, and J Shreffler, PhD
Disclosure: Jacob Shreffler declares no relevant financial relationships with ineligible companies.
Disclosure: Martin Huecker declares no relevant financial relationships with ineligible companies.
This book is distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) ( http://creativecommons.org/licenses/by-nc-nd/4.0/ ), which permits others to distribute the work, provided that the article is not altered or used commercially. You are not required to obtain permission to distribute this article, provided that you credit the author and journal.
- Cite this Page Shreffler J, Huecker MR. Types of Variables and Commonly Used Statistical Designs. [Updated 2023 Mar 6]. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2024 Jan-.
In this Page
Bulk download.
- Bulk download StatPearls data from FTP
Related information
- PMC PubMed Central citations
- PubMed Links to PubMed
Similar articles in PubMed
- The future of Cochrane Neonatal. [Early Hum Dev. 2020] The future of Cochrane Neonatal. Soll RF, Ovelman C, McGuire W. Early Hum Dev. 2020 Nov; 150:105191. Epub 2020 Sep 12.
- Review How to study improvement interventions: a brief overview of possible study types. [BMJ Qual Saf. 2015] Review How to study improvement interventions: a brief overview of possible study types. Portela MC, Pronovost PJ, Woodcock T, Carter P, Dixon-Woods M. BMJ Qual Saf. 2015 May; 24(5):325-36. Epub 2015 Mar 25.
- Review How to study improvement interventions: a brief overview of possible study types. [Postgrad Med J. 2015] Review How to study improvement interventions: a brief overview of possible study types. Portela MC, Pronovost PJ, Woodcock T, Carter P, Dixon-Woods M. Postgrad Med J. 2015 Jun; 91(1076):343-54.
- Trends in the Usage of Statistical Software and Their Associated Study Designs in Health Sciences Research: A Bibliometric Analysis. [Cureus. 2021] Trends in the Usage of Statistical Software and Their Associated Study Designs in Health Sciences Research: A Bibliometric Analysis. Masuadi E, Mohamud M, Almutairi M, Alsunaidi A, Alswayed AK, Aldhafeeri OF. Cureus. 2021 Jan 11; 13(1):e12639. Epub 2021 Jan 11.
- Healthcare outcomes assessed with observational study designs compared with those assessed in randomized trials: a meta-epidemiological study. [Cochrane Database Syst Rev. 2024] Healthcare outcomes assessed with observational study designs compared with those assessed in randomized trials: a meta-epidemiological study. Toews I, Anglemyer A, Nyirenda JL, Alsaid D, Balduzzi S, Grummich K, Schwingshackl L, Bero L. Cochrane Database Syst Rev. 2024 Jan 4; 1(1):MR000034. Epub 2024 Jan 4.
Recent Activity
- Types of Variables and Commonly Used Statistical Designs - StatPearls Types of Variables and Commonly Used Statistical Designs - StatPearls
Your browsing activity is empty.
Activity recording is turned off.
Turn recording back on
Connect with NLM
National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894
Web Policies FOIA HHS Vulnerability Disclosure
Help Accessibility Careers