OAERS

Item Response Theory

Resource Center

Welcome to the Item Response Theory Resource Center. This site provides a range of resources for individuals seeking accessible information about item response theory concepts and applications. Information on this page is compiled by the Office of Assessment, Evaluation, and Research Services (OAERS) in consultation with members of the Department and Educational Research Methodology at The University of North Carolina at Greensboro.

The contents of this Item Response Theory Resource Center are organized according to the six sections listed below. You can jump to a particular section by clicking on the section name in the list below.

  1. Overview of Item Response Theory
  2. Why Use IRT?
  3. Dichotomous IRT Models
  4. Other IRT Models
  5. How to Learn More About IRT?
  6. References for IRT Resources

Overview of Item Response Theory

Item Response Theory is a measurement framework used in the design and analysis of educational and psychological assessments (achievement tests, rating scales, inventories, or other instruments) that measure mental traits.  Item response theory, or IRT for short, is based on establishing a model that specifies the probability of observing each response option to an item as a function of the target trait being measured by the assessment, which is often a knowledge, skill, or ability (e.g., math ability). In testing situations where items are scored as correct or incorrect, IRT specifies the probability of a correct response to an item as a function of ability.

figure1

An example of an IRT model specifying the probability of correct response to an item is shown in Figure 1.  In Figure 1, the horizontal axis represents the level of ability, which is quantified according to a standard scale that ranges from approximately -3 to 3, such that a value of zero typically reflects a moderate level of ability. The vertical axis of Figure 1 represents the probability of correct response to the item. The S-shaped curve of Figure 1 represents the probability of correct response on the item of interest at each level of ability. The curve in Figure 1 is the IRT model for the item of interest, and is referred to as the item response function, or IRF. The IRF is also widely referred to as the item characteristic curve (ICC) because its shape describes the psychometric characteristics of the item.

Each item has its own unique IRF, and the particular shape of the IRF reflects the item’s psychometric properties, such as difficulty and discrimination. By assigning a unique model to each item that informs the item’s psychometric properties, IRT is able to take into consideration the item properties when developing tests, evaluating overall test properties, and estimating the respondent’s ability.

Why Use IRT?

IRT has gained popularity due to its advantages over the simpler measurement framework of Classical Test Theory (CTT).  A primary advantage of IRT is that it offers a rigorous, yet flexible, framework for placing assessments of different items on a common scale. This holds substantial benefit when having to link the scores of multiple forms of an assessment onto a single reporting scale so that the scores have the same meaning across the different forms of the assessment (e.g., to ensure comparability of scores of different forms of a reading assessment administered across successive years). A related application of this advantage is computer adaptive testing, whereby each examinee is administered a set of items that is tailored to the examinee’s level of ability, resulting in different examinees receiving different sets of item while maintaining comparability of the final test scores. Another advantage of IRT is its capability to specify reliability specific to each examinee.  Whereas reliability in CTT is summarized by a single index that is applied equally to all examinees regardless of ability level, item response theory has the flexibility to estimate reliability uniquely for each examinee.  This information can be very useful when different individuals are administered different items (as in computer adaptive testing), or when building test forms with cut-scores or proficiency standards such that the forms can be built to maximize precision (minimize error) around those points on the scale. Despite its benefits, the appropriate application of IRT stronger (harder to meet) statistical assumptions than CTT, and typically requires larger sample sizes than those needed for CTT.

Dichotomous IRT Models

The most commonly encountered IRT models are for dichotomously scored items, whereby the item is scored according to the two outcomes of correct and incorrect. This would be the typical situation, for example, for multiple-choice items that have a single correct answer and one or more incorrect distractor options. For dichotomously scored items, IRT is based on a model that specifies the probability of correct response as a function of ability. An example of an IRT model for a dichotomously scored item is shown in Figure 1.There are three widely used dichotomous IRT models, which vary with respect to the number of parameters they include to modify the shape of the IRF. These three models are referred to as the one-parameter logistic model (1PL model), the two-parameter logistic model (2 PL model), and the three parameter logistic model (3PL model). As the number of parameters in the model increases (i.e., from 1 to 2 to 3), the model becomes more flexible, and thus can provide a more realistic reflection of how the expected response to each item is related to the underlying ability. However, in practical contexts each model has its own unique advantages, and thus the 1PL, 2PL and 3PL models all experience widespread use in applied testing contexts. Each of these three models is described below.

The 1PL Model

The 1PL model is the simplest item response theory model, and specifies an item’s IRF using only one item parameter that reflects the difficulty of the item. The difficulty parameter is denoted by the letter b; as the value of b increases, so too
does the difficulty of the item. The 1PL model specifies the probability that the response (Y) to item “i” is correct using the equation

equation_1PL

In the above equation, θ represents ability and bi represents the value of the b-parameter for item i. The value of bi typically ranges between -2 and +2, but can take on more extreme values. Under the 1PL model, the value of bi represents the level of ability (θ) required for a 50% chance of correct response (i.e., .5 probability of correct response). Thus, if bi = 0, then the probability of correct response will equal .5 at an ability level of θ = 0.

figure2

Figure 2 displays the IRFs for two different items (Item 1 and Item 2) having different values of bi; the value of bi for Item 1 is -1 (i.e., b1 = -1) and the value of bi for Item 2 is 1 (i.e., b2 = 1). Notice how the value of the bi specifies the horizontal location of the item’s IRF; as bi increases, the IRF shifts to the right and the item becomes more difficult. In this manner, Item 2 is more difficult than Item 1, such that for any given level of ability there is a higher probability of correctly responding to Item 1 than to Item 2. Notice that the probability of correct response to Item 1 equals .5 at an ability value of -1, as would be expected given that b1 = -1. Similarly, the probability of correct response to Item 2 equals .5 at an ability value of 1, as would be expected given that b2 = -1.

The 1PL model shown above is equivalent to the Rasch model, proposed by the Danish mathematician George Rasch. It is relevant to note, however, that other forms of the 1PL model exist in which a scaling factor that is constant across all items is included in the model. In some instances, the scaling factor D = 1.7 is included in the exponent, and in other instances the scaling factor is described by an alternative constant value (symbolized by “a”) that leads to the best fit to the data.

The 2PL Model

While the 1PL model has the advantage of simplicity, it lacks the flexibility to allow different items to have IRFs of different steepness (or slope). The 2PL model overcomes this limitation of the 1PL model by including a second parameter (ai) that controls the steepness of the IRF. As ai increases, so too does the steepness of the IRF. Because the steepness of the IRF reflects the how well the item is ability to differentiate, or discriminate, between individuals having different values of θ, ai is commonly referred to as the discrimination parameter. The 2PL model specifies the probability of correct response on the ith item (Yi = correct) using the equation

equation_2PL

Higher levels of item discrimination reflect a higher degree of information that the item provides about the respondent’s ability level. As a result, the value of ai is an indicator of how much information the item provides about the respondent’s ability. Because the 2PL model considers how much information is provided by each item (via ai), in estimating ability using the 2PL model different items are assigned different weights according to the item’s value of ai; the higher the value of ai, the more weight is assigned to the item in estimating ability.

figure3

Figure 3 displays the IRFs for two different items (Item 1 and Item 2) having different values of ai; the value of ai for Item 1 is 2.5 (i.e., a1 = 2.5) and the value of ai for Item 2 is 1 (i.e., a2 = 1). For both of these items, the value of bi = 0. Notice that the value of the ai specifies the steepness of item’s IRF; as ai increases, the IRF becomes steeper and the item becomes more discriminating. In this manner, Item 1 has a higher degree of discrimination than Item 2, and thus would provide a great amount of information concerning the respondent’s level of ability. For this reason, item’s with a higher value of ai tend to viewed as having more desirable psychometric properties than items with a lower value of ai.

The 3PL Model

The final dichotomous IRT model is the 3PL model, which includes the parameter ci to represent the possibility of guessing. The value of ci reflects the lowest possible value of the item’s IRF as ability becomes very low (known as the lower asymptote of the IRF). Thus, if ci = .2, then the probability of correct response for an individual with a very low level of ability would be .2. Because the value of ci reflects the result of guessing behavior, it is often referred to as the pseudo-guessing parameter. The 3PL model specifies the probability of correct response on the ith item (Yi = correct) using the equation

equation_3PL

Note that setting value of ci equal to zero yields the equation for the 2PL model, and then further setting the value of ai equal to unity yields the equation for the 1PL. In this way, the 1PL and 2PL models can be conceptualized as constrained forms of the 3PL model.

figure4

Figure 4 displays the IRFs for two different items (Item 1 and Item 2) having different values of ci; the value of ci for Item 1 is 0 (i.e., c1 = 0) and the value of ci for Item 2 is 1 (i.e., c2 = 0.2). For both of these items, the value of bi = 0 and ai = 2. However, the value of c2 = 0.2 for Item 2 causes the lower bound of the IRF for Item 2 to be higher than that of Item 1, representing the presence of guessing in Item 2. While the value of bi represents the level of ability at which the probability of correct response equals .5 under the 1PL and 2PL models, the same does not hold under the 3PL model when ci > 0. This property is demonstrated in Figure 4, whereby the probability of correct response to Item 1 equals .5 at θ = b1 (recall that c1 = 0), but the probability of correct response to Item 2 equals .5 at θ < b2 (recall that c2 = 0.2).

Other IRT Models

The dichotomous models described above were the initial IRT models popularized due to the widespread use of the multiple-choice items that are scored dichotomously as correct or incorrect. However many other families of IRT models exist that address different item formats and variations in the structure of the traits underlying an individual’s response to an item. This section describes some of these families of models.

Polytomous IRT Models

Dichotomous IRT models are appropriate for items that have two possible score categories, such as correct and incorrect. Polytomous IRT models are appropriate for items that have more than two score categories.  Examples of this would be a test item that allows for partial credit, such as a rated essay question for which examinees can receive zero to four points, or a survey item with multiple response levels (strongly disagree, disagree, agree, or strongly agree).

Multidimensional IRT Models

The most common item response theory models assume a single target trait (e.g., math ability) underlies responses to the items of an assessment. These models are referred to as unidimensional models, due to the single (uni) dimension assumed to account for the differences between individuals.  Multidimensional IRT (MIRT) models can be used when more than one trait are believed to account for the differences between individuals.

Testlet IRT Models

A testlet is a group of items that are dependent, where the dependence is most often attributable to a common stimulus. For example, consider an assessment of reading comprehension for which the examinee is administered a series of reading passages, and for each reading passage there are a group of multiple-choice items asking the examinee questions about the associated reading passage. In this instance, each group of items associated with a particular reading passage form a testlet because the responses to each group of items are dependent not only on overall reading comprehension, but also familiarity with the content of the particular passage. In this context, the content of the passage creates a secondary dimension that impacts the probability of success on each item of the passage; high familiarity with the passage content is expected to increase the chance of success across all items of the testlet. Because the passage content effect on chance of success is assumed to influence the responses to all items of the testlet, a dependency is created between the items of the testlet after controlling for the targeted trait of interest (reading comprehension in this example). This dependency is not accounted for by the standard dichotomous IRT models described above.

To account for testlet effects, a series of testlet models have been developed that include consideration of the targeted trait as well as the effect of the testlet (e.g., familiarity with a reading passage content) on the probability of correct response to each item of the testlet. Testlet IRT models are similar to the 1PL, 2PL, and 3PL dichotomous IRT models described above with the exception that they include two latent variables determining the respondent’s success on the item: (a) the first latent variable is θ, which is the general trait targeted by the assessment, and is the only latent variable included in the 1Pl, 2PL, and 3PL models; and (b) a testlet variable that describes the examinee’s overall proficiency specific to the testlet to which the item belongs (e.g. the examinee’s familiarity with the testlet content). Each examinee, then, has one overall proficiency value (θ), as well as a separate testlet proficiency value for each of the testlets on the assessment. As a result, testlet IRT models are a special case of
multidimensional IRT models, whereby each item of a testlet has two associated dimensions; one dimensional is the overall proficiency targeted by the assessment (θ), and the second dimension is the testlet proficiency value specific to the testlet to which the item belongs.

Diagnostic Classification Models

The initial development of IRT centered on models for which the trait targeted by the assessment was a continuous latent variable. This is the case, for example, in the 1PL, 2PL, and 3PL dichotomous IRT models shown in Figures 2-4 above. This form of IRT model is appropriate when the targeted traits have outcomes that conform to a continuum, such as an ability continuum, for which there exists a range of possible levels. In some assessment contexts, however, the targeted traits have outcomes that are more appropriately conceptualized as categories, such as mastery and non-mastery. The categories of such latent variables are often referred to as latent classes. In this type of assessment context, the purpose of the assessment is not to estimate the location of individuals along a latent continuum, but rather to identify which class (e.g., master or non-mastery) to which an individual belongs.

When the trait targeted by the assessment conforms to a set of categories, IRT models that are based on categorical traits, or latent classes, should be used. These latent class IRT models are commonly referred to as diagnostic classification models, or DCMs for short. In assessment contexts for which the responses to items can be appropriately explained by the mastery of a set of skills, DCMs can be used to classify each respondent according to mastery or non-mastery of each of the skills underlying the items of the assessment. The resulting profile of skill mastery is often useful in formative assessment situations to provide information concerning a student’s areas of strength and weaknesses.

There are a wide range of DCMs that vary depending on their complexity and assumptions concerning the nature of the latent classes and how the latent classes interact to generate responses to items. Most applications of DCMs involve more than one classification variable (e.g., mastery vs. non-mastery of more than one skill), and thus DCMs are a specialization of multidimensional IRT models for which each dimension consists of a categorical variable. Relevant references providing a broader description of DCMs and their application are provided below in the section “How to

Learn More About Item Response Theory?”

How to Learn More About IRT?

Numerous resources are available for learning more about the concepts and applications of IRT. To assist practitioners and researchers identify appropriate resources related to IRT, this section lists relevant books and articles across a range of IRT topics. Full citations of these resources are provided at the bottom of this page under the References section. Individuals seeking more extensive training in IRT may consider obtaining a graduate degree specializing in educational measurement or psychometrics, such as the graduate degree programs offered by the Department of Educational Research Methodology at The University of North Carolina at Greensboro.

Non-Technical Overviews of IRT

For those seeking a relatively short and accessible overview of IRT, two good sources to start with are DeMars (2010) and Hambleton, Swaminathan, and Rogers (1991). Both of these sources describe the basics of IRT with little assumptions of previous math or statistics knowledge, and do so in a concise manner. Embretson and Reise (2000) also provide a highly accessible overview of IRT that makes little assumption of previous knowledge in math or statistics, but provide a deeper discussion of IRT models, concepts, and applications than what is provided in the substantially shorter books by DeMars (2010) and Hambleton et al. (1991). The Embretson and Reise (2000) book is noted for its extensive discussion of how IRT presents a new measurement framework, and the advantages that come with this new framework over classical test theory. The title of Embretson and Reise (2000) reflects a focus on IRT applications for psychology, but it is often viewed as an appropriate account of IRT for individuals across the behavioral, social, and health sciences.

Nontechnical overviews of the Rasch family of models can be found in several sources. Bond and Fox (2007) provide a highly accessible description of the Rasch measurement framework, with very little reliance on mathematical equations. Andrich (1988) and Wright and Masters (1982) also provide discussions of the Rasch measurement framework and relevant Rasch models that do not require extensive knowledge of math or statistics.

Comprehensive Descriptions of IRT and Parameter Estimation

For those seeking a comprehensive account of IRT, de Ayala (2009) can serve as a nice resource as it covers a wide range of IRT models, parameter estimation, and applications. This book does dig into the mathematics of IRT a little more than the resources discussed above under “Non-Technical Overviews of IRT”, but it is still highly accessible to readers having a wide range of technical backgrounds. Lord (1980) and Hambleton and Swaminathan (1985) are classic resources, but require more mathematical and statistical sophistication than de Ayala (2009), and also place the majority of their focus on dichotomous models. For readers seeking a comprehensive account of IRT parameter estimation, Baker and Kim (2004) is the way to go. This resource provides detailed accounts of the derivations of maximum likelihood, marginal maximum likelihood, and fully Bayesian methods of estimation. Fox (2010) is a comprehensive reference on Bayesian methods for IRT. A comprehensive and technical treatment of Rasch models, their theoretical deviations, and relevant applications can be found in the edited volume by Fischer and Molenaar (1995).

For readers seeking a condensed description of IRT and Rasch models, there are several chapters in edited volumes that may be helpful. The chapter by Yen and Fitzpatrick (2006) provides a broad overview of IRT, including descriptions of many IRT models (dichotomous, polytomous, multidimensional, testlet, etc.) and applications. More general treatments of IRT are provided in the chapters by Thissen and Steinberg (2009), Edwards and Edelen (2009), and Bock and Moustaki (2007).

Descriptions of Advanced IRT Models (Polytomous, Multideminsional, etc.)

Polytomous Models: Readers seeking a comprehensive discussion of polytomous IRT models have several resources at their disposal. An accessible overview of the major polytomous IRT models is provided by Penfield (2014) and Ostini and Nering (2006). These resources are concise, and make few assumptions about the mathematical and statistical background of the reader. Nering and Ostini (2010) is an edited volume that includes chapters on many of the widely used polytomous IRT models, and also conceptual and statistical issues and properties associated with these models. This volume, in general, is a little more theoretical in nature than the Ostini and Nering (2006) book.  The edited volume by van der Linden and Hambleton (1997) includes comprehensive coverage across many important polytomous models, including descriptions of the models, overviews of parameter estimation for the models, as well as discussion of model applications. Overviews of polytomous IRT models can also be found in Embretson and Reise (2000), de Ayala (2009), and Baker and Kim (2004; especially useful for parameter estimation of polytomous IRT models).

Multivariate Models: For readers seeking descriptions of multivariate IRT (MIRT) models, Reckase (2009) provides the most comprehensive treatment of the topic currently available. This book offers a broad overview of the most widely disseminated MIRT models, along with descriptions of relevant applications. Readers seeking a more concise introduction to MIRT models and their application are referred to Ackerman, Gierl, and Walker (2003).

Testlet Models: Testlet models are a relatively new addition to the field of IRT, and as a result there are few available resources describing these models. The seminal, and still most comprehensive, treatment of testlet models is provided by Wainer, Bradlow, and Wang (2007).  This book provides and extensive overview of testlet response theory, associated models, and applications of testlet models to real testing scenarios. This book is a valuable resource for individuals seeking a broader understanding of testlet models, their application, and the underlying theory of testlets. A brief overview of testlet models is provided in both Yen and Fitzpatrick (2006), and de Ayala (2009).

Diagnostic Classification Models: Diagnostic classification models is a growing area of IRT that, until recently, had few accessible resources describing relevant models and applications.  Rupp, Templin, and Henson (2010) provide the most thorough treatment of this modeling approach. This book provides a broad overview of the actual models, parameter estimation, and application. This book is an organized treatment of DCMs that is accessible to individuals with a wide range of math and statistics backgrounds. Leighton and Gierl (2007) is an edited book containing chapters on a range of topcis pertaining to diagnostic measurement, its models, and applications. Readers looking to learn more about diagnostic classification models and their application would benefit from both of these resources.

Descriptions of IRT Applications (Equating, CAT, etc.)

Linking and Equating: The application of IRT to linking and equating is described in several sources. Now in its second edition, Kolen and Brennan (2004) provides a comprehensive treatment of the topic. Dorans, Pomerich, and Holland (2007) offers an edited volume describing a wide range of issues related to linking and equating, and provides discussions of IRT applications throughout. Readers seeking a concise and accessible overview of IRT methods for equating are referred to Cook and Eignor’s (1991) instructional module on the topic.

References for IRT Resources

Ackerman, T. A., Gierl, M. J., & Walker, C. M. (2003). Using multidimensional item response theory to evaluate educational and psychological tests. Educational Measurement: Issues and Practice, 22(3), 37-53.

Andrich, D. (1988). Rasch models for measurement. Newbury Park: Sage.

Baker, F. B., & Kim, S., -H. (2004). Item response theory: Parameter estimation techniques (2nd ed.). New York: Marcel Dekker, Inc.

Bock, R. D., & Moustaki, I. (2007). Item response theory in a general framework. In S. Sinharay & C. R. Rao (Eds.), Handbook of Statistics, Volume 26: Psychometrics (pp. 469-513). New York: Elsevier.

Bond, T.G., & Fox, C.M. (2001). Applying the Rasch model: Fundamental measurement in the human sciences (2nd ed.).  New Jersey: Lawrence Erlbaum.

Cook, L. L., & Eignor, D. R. (1991). IRT equating methods. Educational Measurement: Issues and Practice, 10(3), 37-45.

De Ayala, R. J. (2009). Theory and practice of item response theory. New York: Guilford Press.

DeMars, C. (2010). Item response theory. New York: Oxford University Press.

Dorans, N. J., Pomerich, M., & Holland, P. W. (Eds.). (2007). Linking and aligning scores and scales. New York: Springer.

Edwards, M. C., & Edelen, M. O. (2009). Special topics in tem response theory. In R. E. Millsap & A. Maydeu-Olivares (Eds.), The Sage handbook of quantitative methods in psychology (pp. 178-198). Thousand Oaks, CA: sage.

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. New Jersey: Erlbaum.

Fischer, G., & Molenaar, I. W. (Eds.). (1995). Rasch models: foundations, recent developments, and applications. New York: Springer-Verlag.

Fox, J. -P. (2010). Bayesian item response modeling. New York: Springer.

Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston: Kluwer.

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Part: Sage.

Kolen, M.J., & Brennan, R.L. (2004). Test equating, scaling, and linking: Methods and practices (2nd ed.). New York: Springer.

Leighton, J. P., & Gierl, M. J. (Eds.). (2007). Cognitive diagnostic assessment for education: Theory and Applications. New York: Cambridge University Press.

Lord, F. M. (1980). Applications of item response theory to practical testing problems  Hillsdale, NJ:  Erlbaum.

Nering, M. L., & Ostini, R. (2010). Handbook of polytomous item response theory models. New York: Routledge.

Ostini, R., & Nering, M. L. (2006). Polytomous item response theory models. Thousand Oaks, CA: Sage.

Penfield, R. D. (2014). An NCME instructional module on polytomous item response theory models. Educational Measurement: Issues and Practice, 33(1), 36-48.

Reckase, M. D. (2009). Multidimensional item response theory. New York: Springer.

Rupp, A. A., Templin, J., & Henson, R. A. (2010). Diagnostic measurement: theory, methods, and applications. New York: The Guilford Press.

Thissen, D., & Steinberg, L. (2009). Item response theory. In R. E. Millsap & A. Maydeu-Olivares (Eds.), The Sage handbook of quantitative methods in psychology (pp. 148-177). Thousand Oaks, CA: sage.

van der Linden, W. J., & Hambleton, R. K. (Eds.). (1997). Handbook of modern item response theory. New York: Springer-Verlag.

Wainer, H., Bradlow, E. T., & Wang, X. (2007). Testlet response theory and its applications. New York: Cambridge University Press.

Wright, B. D., & Masters, G. N. (1982). Rating scale analysis.  Chicago: MESA Press.

Yen, W. M., & Fitzpatrick, A. R. (2006). Item response theory. In R. L. Brennan (Ed.), Educational Measurement, 4th ed. (pp. 111-153). Westport, CT: Praeger.