Which measure of validity is based on showing a substantial correlation between test scores?

Item and Test Bias

Howard Wainer, Stephen G. Sireci, in Encyclopedia of Social Measurement, 2005

Evaluating Test Bias

The evaluation of test bias focuses on test scores rather than on test items. In most cases, evaluation of bias operates within a predictive validity framework. Predictive validity is the degree to which test scores accurately predict scores on a criterion measure. A conspicuous example is the degree to which college admissions test scores predict college grade point average (GPA). Given this predictive context, it should not be surprising that regression models are used to evaluate predictive validity. The analysis of test bias typically investigates whether the relationship between test and criterion scores is consistent across examinees from different groups. Such studies of test bias are often referred to as studies of differential predictive validity.

There are two common methods for evaluating test bias using regression procedures. The first involves fitting separate regression lines to the predictor (test score) and criterion data for each group and then testing for differences in regression coefficients and intercepts. When sufficient data for separate equations are not available, a different method must be used. This method involves fitting only one regression equation. In one variation of this method, the coefficients in the equation are estimated using only the data from the majority group. The analysis of test bias then focuses on predicting the criterion data for examinees in the minority group and examining the errors of prediction (residuals). In another variation of this method, the data from all examinees (i.e., majority and minority examinees) are used to compute the regression equation. The residuals are then compared across groups.

The simplest form of a regression model used in test bias research is y = b1X1 + a + e, where y is the predicted criterion value, b1 is a coefficient describing the utility of variable 1 for predicting the criterion, a is the intercept of the regression line (i.e., the predicted value of the criterion when the value of the predictor is zero), and e represents error (i.e., variation in the criterion not explained by the predictors). If the criterion variable were freshman-year GPA and the predictor variable were Scholastic Aptitude Test (SAT) score, b1 would represent the utility of the SAT for predicting freshman GPA. To estimate the parameters in this equation, data on the predictor and criterion are needed. The residuals (e in the equation) represent the difference between the criterion value predicted by the equation and the actual criterion variable. Every examinee in a predictive validity study has a test score and criterion score. The residual is simply the criterion score minus the score predicted by the test.

Analysis of the residuals in a test bias study focuses on overprediction and underprediction. Overprediction errors occur, for example, when the predicted GPA for a student is higher than her/his actual GPA. Underprediction errors would occur when the predicted GPA for a student is lower than her/his actual GPA. By examining the patterns of over- and underprediction across the examinees from different groups, evidence of differential predictive accuracy is obtained. For example, if the errors of prediction tended to be primarily underprediction errors for females, it could be concluded that the test is biased against females.

This method is flawed, because it can indicate test bias when there is none if the group means are different. Such a result, illustrated in Fig. 2, is merely a regression effect and the use of this methodology when group means are different on both the test and the criterion leads to incorrect inferences. The confusion of impact with bias in the popular media has long added confusion to discussions of this delicate issue. The use of regression methods for detection is not an unmixed blessing.

Which measure of validity is based on showing a substantial correlation between test scores?

Figure 2. A graphical counterexample to provide a warning for those who would use regression methods to detect test bias when the groups being investigated show substantial differences in level. The two ellipses represent the bivariate distributions of scores on the prediction and the criterion for two different groups (group A and group B). The principal axes of both ellipses lie on the 45° line, but the regression lines are tilted, yielding differential predictions based on group membership. This is not bias, but rather an example of a regression effect.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985004461

Personnel Selection, Psychology of

H. Schuler, in International Encyclopedia of the Social & Behavioral Sciences, 2001

5.1 Validity

Among the criteria of psychometric quality, validity is the most important, the others are necessary, but not sufficient conditions for validity. Among the variants of validity (or strategies of validation), predictive validity plays a crucial role, since the objective usually is the prediction of future occupational success. The following diagnostic procedures demonstrated good or sufficient validity (Salgado 1999, Schmidt and Hunter 1998): tests of general intelligence, work samples, job-related structured interviews, biographical questionnaires, some specific personality tests (integrity, conscientiousness, achievement motivation), and multimodal procedures (assessment center, potential analysis); tests of job knowledge and assessment of job performance are in a similar range. It must be noted that there are some moderating conditions (e.g., the validity of biographical questionnaires for young persons is low) and that validation leads to different coefficients for different measures of success—i.e., supervisory assessment, position level, income, assessment of potential. The aspect of content validity is important especially during the steps of instrument construction, that of construct validity in the examination of the psychological meaning of test scores.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0080430767013942

Organizing

John J. Fay, David Patterson, in Contemporary Security Management (Fourth Edition), 2018

Test the Apparent Best Candidate

Alcohol and drug testing are common procedures for evaluating a candidate for a security job, or for that matter any job that involves protection of human life or sensitive assets such as nuclear material and secret information. The negative effects of alcohol and drugs on human performance in the workplace are widely known. They include lost productivity, absenteeism, high accident rates and medical costs, theft, and violence.

Human behavior can be assessed also through psychological testing. Test content may be directed to almost any facet of intellectual or emotional functioning. Among the aspects of greatest concern to a CSO are honesty, propensity for violence, personal traits, values, and attitude. Test results are obtained by comparing the individual’s responses against standard responses, with the standard responses having been developed previously by commonly accepted scientific methods. A test score can predict an individual’s behavior in a narrowly defined set of circumstances, and is said to have predictive validity when it yields consistent, reliable measurements. For example, an honesty test has predictive validity if persons who score high are later shown by their behaviors to be honest. When accurate in predicting a job applicant’s future behavior, psychological tests can be valuable hiring tools.

Proficiency tests can be used to select job candidates and to determine their suitability for particular jobs. Aptitude tests predict future performance in a job for which the individual is not currently trained. If a person’s score is similar to scores of others already working in a given job, likelihood of success in that job is predicted. Some aptitude tests cover a broad range of skills pertinent to many different occupations. The General Aptitude Test Battery is an example. It not only measures general reasoning ability but includes measures of perception, motor coordination, and finger and manual dexterity.

Intelligence tests measure the global capacity of an individual to cope with the environment. Test scores are generally known as intelligence quotients. Objective personality tests measure social and emotional adjustment. Responses that briefly describe feelings, attitudes, and behaviors provide a profile of the personality as a whole. The most popular of these tests are the Minnesota Multiphasic Inventory and the California Psychological Inventory.

One cannot discuss preemployment testing without reference to their critics. The major criticisms stem from two interrelated issues. The first is technical shortcomings in test design. Because technical weakness to some degree is inescapably present in all forms of preemployment testing, the sharpest of critics demand that such testing not be used at all. The mainstream view is that test results represent only one piece of information about an individual, and as such should not be used as the sole criterion for selection or rejection.

The second criticism deals with interpretation and application of results. Opponents argue that testing can under- and overvalue job candidates, and employers who use testing often place an inappropriate reliance on tests. These arguments have been particularly loud in the case of intelligence testing. Psychologists generally agree that using intelligence tests to bar individuals from job opportunities, without careful consideration of other relevant factors, is unethical. Critics have taken their views to the courtroom. A chief argument is that certain tests emphasize skills associated with white, middle-class functioning, resulting in discrimination against disadvantaged and minority groups.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780128092781000025

Organizational Psychology

Oleksandr S. Chernyshenko, Stephen Stark, in Encyclopedia of Social Measurement, 2005

Measurement of Personality

The five-factor structure of personality (extraversion, agreeableness, conscientiousness, neuroticism, and openness to experience) has emerged in the past decade as the predominant framework for describing the basic dimensions of normal personality. Based on this framework, selection researchers have conducted several meta-analyses to examine which “big-five” dimensions are useful for predicting work performance. These studies have shown consistently that conscientiousness is a valid predictor of job performance across all occupations included in the meta-analyses, whereas the other four dimensions predict success in specific occupations, or relate to specific performance criteria. More recently, results of a number of studies have suggested that measures of narrow traits may have higher predictive validity, compared to measures of broad factors. For example, several big-five factor measures, as well as narrow trait measures, were correlated with several performance criteria and it was found that the latter were more effective predictors. Similarly, other studies have shown that narrow facets of a broad factor can have marked but opposite relationships with a performance criterion. Consequently, if measures of the narrow facets are aggregated to assess a broad personality factor, the relationship with an external criterion may fall to zero. In sum, these results suggest that using lower order facets of the big five might increase the validity of personality measures in selection contexts.

There is currently no shortage of personality inventories measuring both broad and narrow big-five traits. Among the most widely used measures for selection are the NEO (neuroticism, extraversion, openness) Personality Inventory (NEO-PI), the Sixteen Personality Factor Questionnaire (16PF), and the California Personality Inventory (CPI). All of these inventories were developed using classical test theory (CTT) methods. They are composed of scales having 10 to 15 self-report items that ask about a respondent's typical behavior. Most measures also contain a lie or social desirability scale to identify respondents who may engage in impression management. Though these scales might work well in research settings, when respondents are motivated to answer honestly, their efficacy in selection contexts is still a matter of concern. Recent research comparing responses to traditional personality items across testing situations has clearly shown that applicants can and will increase their scores, with respect to nonapplicants, when there are incentives to fake “good.” In fact, it is not uncommon for applicants to score 1 SD higher compared to respondents in research settings. The main problem, however, is that not all individuals fake, and certainly not to the same extent, so faking affects the rank order of high-scoring individuals and, thus, the quality of hiring decisions. Moreover, because approaches to detect and correct for faking post hoc, using social desirability/impression management scores, are generally ineffective, interest may be shifting toward preventing faking by using veiled items, threats of sanctions, or alternative items formats. An example of the latter approach involves the construction of multidimensional forced-choice items that are designed to be fake resistant. By pairing statements on different dimensions that are similar in social desirability and asking a respondent to choose the statement in each pair that is “most like me,” items are more difficult to fake. However, important concerns have been raised about the legitimacy of interindividual comparisons due to ipsativity. In an effort to address that concern, a mathematical model for calibrating statements has been proposed, constructing multidimensional forced-choice tests and scoring respondents. In simulation studies conducted to date, this approach has shown to recover known latent trait scores, representing different personality dimensions, with a reasonable degree of accuracy. But more research is needed to examine the construct and predictive validity of such personality measures in applied settings.

Aside from concerns about faking, there is a basic question that should be examined in the area of personality assessment. Namely, studies are needed to explore the way in which persons respond to personality items. Many measures have been developed and a great deal of research has been conducted, assuming that a dominance process underlies item responding; that is, a person will tend to endorse a positively worded item when his/her standing on the latent dimension is more positive than that of the item. However, some recent studies suggest that an ideal point response process may be better suited for personality items; i.e., respondents tend to agree with items that are located near them on the latent continuum and to disagree with those that are distant in either direction. In IRT terms, this means that some personality items may exhibit nonmonotonic, folded item response functions (IRFs). In the case of neutral items, the IRFs may be bell-shaped when computed based on an ideal point model.

The possibility of using nonmonotone items in personality questionnaires opens new avenues for theoretical and applied research. First, there is a need to determine what features of items can yield nonmonotone response functions and whether some personality dimensions tend to have more such items than others have. Second, the effects on construct and predictive validity due to eliminating or retaining items of neutral standing during personality test construction have not been investigated. If truly nonmonotone items are scored using dominance models (i.e., number right or summated ratings), the rank order of high/low-scoring individuals may be affected by the choice of scoring model; so that the accuracy of hiring decisions could suffer. More research is needed to determine whether these results indeed exert measurable influences on validity and utility in organizations.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985005296

Assessment with Brief Behavior Rating Scales

ROBERT J. VOLPE, GEORGE J. DUPAUL, in Handbook of Psychoeducational Assessment, 2001

ADHD Symptom Checklist—IV

General Overview and Psychometric Characteristics

The ADHD-Symptom Checklist—4 (SC-4; Gadow & Sprafkin, 1997) is a 50-item parent- and teacher-completed checklist composed of four categories: (a) ADHD, (b) Oppositional Defiant Disorder (ODD), (c) the Peer Conflict Scale, and (d) the Stimulant Side Effects Checklist. Items of the ADHD and ODD categories are highly similar to individual diagnostic criteria for the corresponding disorders set forth in the DSM-IV. The SC-4 was developed for several uses. First, the SC-4 was designed as a screening instrument for the most common causes of referral (e.g., disruptive child behavior) to child psychiatric clinics and to monitor changes in these symptoms during treatment. Second, given the prescription rate of psychostimulants in children with externalizing behavior difficulties, the developers of the SC-4 provided a measure of stimulant side effects that includes three indices (Mood, Attention-Arousal, and Physical Complaints).

No internal consistency data are available for the SC-4. The test-retest reliability of the SC-4 appears adequate. Reliability coefficients for the symptom severity scores of the ADHD, ODD, and Peer Conflict categories (6-week latency) ranged from .67 to .89.

No factor analytic data are available for the SC-4. The discriminant and concurrent validities of the SC-4 have been investigated and appear adequate. With a few exceptions, scores on the SC-4 have been shown to discriminate between “normal” and clinically referred groups of children and adolescents (Sprafkin & Gadow, 1996; Gadow & Sprafkin, 1997). Supporting the concurrent validity of the SC-4, the manual reports moderate to high correlations between the SC-4 categories and commonly used checklists such as the CBCL, TRF, the Mother's Objective Measure for Subgrouping (MOMS; Loney, 1984), and the IOWA Conners’ (Loney & Milich, 1982).

The predictive validity of the SC-4 was assessed by investigating the degree to which cutoff scores on various SC-4 categories agreed with relevant clinical diagnoses. The statistics of sensitivity (the degree to which a measure minimizes false negatives) and specificity (the degree to which a measure minimizes false positives) are commonly used for assessing predictive validity. Generally, the predictive validity of the parent- and teacher-completed SC-4 was moderate to high (i.e., sensitivity between .58 and .89; specificity between .57 and .94).

The treatment sensitivity of the SC-4 has been investigated in several double-blind placebo-controlled studies of stimulant medication (Gadow, Nolan, Sverd, Sprafkin, & Paolicelli, 1990; Gadow, Sverd, Sprafkin, Nolan, & Ezor, 1995). Differences in scores between doses indicate that the SC-4 is a good measure of response to stimulant medication. Furthermore, the instrument appears sensitive to several stimulant side effects.

The normative data for the SC-4 were recently expanded (Gadow & Sprafkin, 1999). According to the authors, T-scores between old and new samples are very similar; however, some differences may be noted between the manual (Gadow & Sprafkin, 1997) and the revised Score Sheets. Normative data are available on 4559 children and adolescents between the ages of 3 and 18. It should be noted that normative data for the SC-4 categories of ADHD and ODD were, with few exceptions, generated with other checklists developed by the same authors (e.g., Early Childhood Inventories and Child Symptom Inventories). Items are identical except for eight ADHD items that were shortened for the SC-4. With the exception of the preschool samples, the normative samples are smaller for the Peer Conflict scale. In general, data were gathered across a number of geographic regions; however, minorities were somewhat underrepresented for some age groups.

Administration and Scoring

Checklists, Score Summary Records, and Score Sheets for the SC-4 may be obtained with the manual as a kit, and purchased separately thereafter. Identical checklists may be used for parents and teachers, and both parent and teacher scores can be recorded on the same Score Summary Record. There are also separate Score Sheets for parent- and teacher-completed checklists, which present male and female scoring information on either side of the form.

The SC-4 should take no more than 10–15 minutes for informants to complete, which they do by recording raw category scores in the cells on the form. In using the symptom severity method of scoring, individual items are scored as follows: “Never” = 0, “Sometimes” = 1, “Often” = 2, and “Very often” = 3. Item scores from each category are then summed to obtain raw category scores. The Inattentive and Hyperactive/Impulsive scores are summed to obtain an ADHD Combined Type raw score. Separate Score Sheets are available that include tabulated T-scores. The SC-4 may also be scored using the Symptom Criterion method. Here, items are scored as follows: “Never” and “Sometimes” = 0, “Often” and “Very often” = 1. The DSM-IV specifies the number of symptoms required to meet criteria for various diagnoses, and this serves as the basis for meeting criteria for the disorders represented in the SC-4. The authors of the SC-4 provide the DSM-IV symptom count criteria on the Score Summary Sheet.

Usability and Usefulness

The SC-4 appears to be a useful measure of childhood externalizing behavior difficulties (ADHD, ODD, and Interpersonal Aggression). Given the high degree of comorbidity among childhood disruptive behavior disorders, an instrument like the SC-4 that is sensitive to both ADHD and more severe behavior problems is highly desirable. The Peer Conflict Scale and the Stimulant Side Effects Checklist appear to be useful indices of interpersonal aggression and stimulant side effects, respectively. Furthermore, data supporting the sensitivity of this instrument to treatment conditions suggest that the SC-4 is a useful instrument for monitoring child behavior. Finally, the inclusion of the Stimulant Side Effects Checklist makes the SC-4 a valuable tool in the titration of stimulant medication.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780120585700500148

Information, Economics of

S.S. Lippman, J.J. McCall, in International Encyclopedia of the Social & Behavioral Sciences, 2001

3 Contracts, Incentives, and Asymmetric Information

3.1 Introduction

The past two decades of the twentieth century have witnessed dramatic changes in economic theory. Several years ago most graduate texts in economic theory were founded on price theory. Today, in the major economic departments, economic theory revolves around noncooperative game theory. Price theory has been relegated to the back seat. Before the ascendance of game theory, the emphasis in economics was on the usefulness of economic models in explaining empirical phenomena. While this emphasis persists, it has been overshadowed by the quest to apply strategic thinking to resolve economic problems.

The key ingredients of this remarkable transformation are: the Nash equilibrium (NE) concept, Harsanyi's characterization of contracts with asymmetric information as noncooperative games, and Selten's explicit consideration of time and the elimination of equilibria associated with noncredible threats and promises.

Nash's equilibrium is a strategy for each player where each strategy is a best response to the strategies of the other players in the n-person noncooperative game. This concept pervades economics. As a simple example, consider the traveler who drives a car first in the US and then in the UK. He drives on the right-hand side of the road in the US as does everyone else. When he visits the UK, his best response to British drivers is to drive on the left side. These Nash equilibria are akin to rational expectation's equilibria.

The NE is not immaculate. Jack Hirshleifer (private communication), sees

two major problems with NE: (1) Each player's decision is supposed to be a ‘best reply’ to the opponent's corresponding choice. But the Nash protocol requires simultaneity. Thus, neither side knows the opponents' strategy to which it is supposed to be replying optimally. Without that knowledge, how can someone make a ‘best reply’?…(2)…the ‘best reply’ has to be to the opponent's strategy, and not just to his observed actions or moves. In all but the very simplest cases, only a tiny fraction of a player's full strategy will ever be visible.…So in general the player can never know whether his current strategy is or is not a ‘best reply’ to what the opponent has in mind. …NE is fine for ‘toy worlds’ of our textbooks, but I remain skeptical of its general predictive validity.

van Damme and Weibull (1995) observe that ‘John Harsanyi showed that games with incomplete information could be remodeled as games with complete but imperfect information, thereby enabling analysis of this important class of games and providing a theoretical foundation for “the economics of information”.’ Signaling, moral hazard, and adverse selection are prominent members of this class. A game is said to be one of complete information if Nature makes the first move and this move is observed by all players. A game has perfect information if each information set—the set of nodes in the tree such that one is known to be the actual node—is a singleton. See Rasmusen (1989). Harsanyi adopts a Bayesian approach assuming that each player may be of several types—a type specifies the information a player possesses regarding the game. The resulting refinement of Nash's equilibrium is called a Bayes–Nash equilibrium.

Reinhard Selten was the first to refine the Nash equilibrium for analysis of dynamic strategic interactions. Such refinement is necessary since many equilibria entail noncredible threats and do not make economic sense. Selten's formalization of the requirement that only credible threats should be considered, the concept of subgame perfect equilibrium, is used widely in the industrial organization literature. It has generated significant insights there and in other fields of economics.

At first it was not clear how the problems of asymmetric information could be formulated as noncooperative games. Indeed, much significant research in signaling and insurance was performed in the 1970s with little reference to game theory. These models generated important economic implications, but were stymied by problems of equilibrium and credibility. This was changed suddenly when fundamental papers by Kreps and Wilson (1982) and others showed that this research could use the deep insights of Harsanyi and Selten to illuminate credibility and equilibrium. The ensuing unification of industrial organization was analogous to the epiphany that occurred in search theory when optimal stopping, dynamic programming, and matching were applied to search problems.

3.2 Principal-agent Models

This section is a brief discussion of three of the most important asymmetric information models: moral hazard, adverse selection, and signaling. All three belong to the class of principal-agent models. In these models, the principal P designs the contract. The agent A either accepts or rejects the contract depending on the expected utility of the contract vis-à-vis the utility from other alternatives. Finally, the agent performs the task specified by the principal. The two parties are opposed in that the revenue for the agent is a cost for the principal and the effort of the agent is costly to him and beneficial to the agent. In determining the optimal contract between principal and agent, this opposition must be resolved. Macho-Stadler and Perez-Castrillo (1997) see this as ‘one of the most important objectives of the economics of information.’

There are two parties to a P–A model. One is informed while the other does not know a crucial piece of information. For example, in the insurance P–A model, the insurance company is the principal and the insuree is the agent. It is usually assumed that the agent knows his health status, whereas the principal is uncertain. The P–A model is a bilateral monopoly. Hence, the nature of the bargaining process must be specified. For simplicity, it is assumed that either P or A receives all of the surplus. For example, P does this by stipulating a ‘take-it-or-leave-it’ contract, which the agent either accepts or rejects. Salanie (1997) notes that bargaining under asymmetric information is very complicated: ‘There is presently no consensus among theorists on what equilibrium concept should be used.’ Salanie observes that: ‘the P–A game is a Stackelberg game in which the leader (who proposes the contract) is called the principal and the follower (the party who just has to accept or reject the contract) is called the agent.’

The actual bargaining in a concrete setting is unlikely to have this 0–1 structure, where the game terminates if either an agreement or disagreement happens. Instead, the bargaining may continue or, more likely, the disgruntled agent will search until he finds an acceptable contract. Such a model would combine the P–A analysis with search theory.

3.3 Three Principal-agent Models with Asymmetric Information

Moral hazard occurs when the agent's actions are not observed by the principal. More specifically, moral hazard is present when the subject's actions are: (a) influenced by the terms of the contract, and (b) in a way that is not fully specified in the contract. These hidden actions are evident in automobile insurance contracts, where the driving behavior of the insuree is not known by the insurance company. The contract usually has a deductible to mitigate the risk to the insurance company.

Adverse selection occurs when the agent is privy to pertinent information before the contract is finalized. This information is hidden from the principal. Adverse selection is evident in health insurance contracts where certain characteristics (health status) of the agent are imperfectly known by the insurance company. An insurance contract is designed for each group with a particular health characteristic (sickly or healthy) so that each member of the group has an incentive to buy it.

Signaling occurs when one of the parties (the agent) to the contract has pertinent information regarding his type. Before entering the contract, the agent can signal the principal that he has this type, i.e., the behavior of the informed party conveys (signals) the type information to the uninformed party.

A key observation is that each of these asymmetric information models can be represented as a noncooperative game. The crucial equilibrium concept is a refinement of the Nash equilibrium. For example, the signaling game has a perfect Bayesian equilibrium. Excellent discussions are given in Gibbons (1992) and Macho-Stadler and Perez-Castrillo (1997).

One of the most important topics in the economics of information is the optimal design of contracts under symmetric and asymmetric information. In the symmetric case, the principal designs the contract so that the expected marginal payoff equals the marginal cost of effort. If the principal is risk neutral, then the optimal contract is one in which the principal accepts all the risk and the agent is, therefore, fully insured, receiving a payoff that is independent of the outcome. In the asymmetric case, the situation is much more complicated. The optimal contract balances the conflict between two opposing goals: efficiency in the allocation of risk between principal and agent and maintaining the incentives of the agent. For lucid presentations of the analysis, see Macho-Stadler and Perez-Castrillo (1997) and Salanie (1997).

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0080430767022440

Evaluation

Beverly Park Woolf, in Building Intelligent Interactive Tutors, 2009

6.2.2 Stat Lady: A Statistics Tutor

One goal of the Stat Lady tutor was to teach introductory descriptive statistics (e.g., data organization and plotting) to adult learners (Shute, 1995; Shute and Gawlick-Grendell, 1993, 1994). The premise was that learning is a constructive process and should be facilitated by an environment anchored in real-world examples and problems. The fundamental idea was that learning should be based on prior, familiar knowledge. Stat Lady provided an interactive environment in which students were encouraged to become actively involved. It stored and continually updated a probability vector consisting of curriculum elements (CEs) and fundamental units of knowledge in descriptive statistics, e.g., mean or median (Shute, 1995). Each CE was linked to a measure of the probability that a student had mastered that particular element. CEs represented symbolic knowledge (e.g., construct the formula for the mean), procedural knowledge (e.g., collect data and compute the mean), and cognitive knowledge (e.g., distinguish the definition and characteristics of the mean from the median and mode). These independent chunks of knowledge maintained a score for each element based not only on whether the student solved the problem correctly or not, but also on the amount of help provided.

Goals of the Stat Lady evaluation included predicting learning outcomes, showing the predictive validity of student model, and determining some factors that computers contribute to learning. Hypotheses included that (1) students using StatLady would acquire greater procedural skills, (2) high-aptitude students would learn more from the tutor, (3) low-aptitude students would learn more from traditional teachers, and (4) aptitude-treatment interactions (differential impact of treatments for a student's aptitude) exist.

The evaluation design included both benchmark (tutor versus lectures and workbook) and within-system or tutor1 versus tutor2 (nonintelligent and intelligent versions of the tutor) comparisons. There were two treatment groups and one control group. The nonintelligent version delivered the same curriculum for all learners and used fixed thresholds for progress through the curriculum. The intelligent version individualized the sequence of problems based on analysis of the learner's degree of mastery and a more complex, symbolic, procedural, and conceptual knowledge representation (CEs) to provide more focused remediation. The evaluation design was pretest+intervention+posttest.

The evaluation design was instantiated with dependent (degree of learning and time to achieve mastery) and independent measures (student aptitude, pretest score and demographics) for the two groups, tutor1 versus tutor2. Most studies involved hundreds of students recruited from an employment agency; those with substantial knowledge of statistics were filtered out. In all, 331 students were involved in the tutor versus workbook study, 168 students in the unintelligent tutor versus traditional lecture, and 100 students in the intelligent tutor study. Student aptitude was extensively tested (cognitive and personality tests) before students worked with the tutor. This evaluation included usability studies and asked students about their experiences with the tutor.

Stat Lady results included impressive predictive capability on posttests showing that the student model was a strong indicator of posttest results and accurately identified student ability (Figure 6.5). This measure of diagnostic validity compared student model values for the probability of knowing a curriculum element p(CE) and the corresponding curriculum element (CE). Four cognitive ability measurements (e.g., grades on standard exams) accounted for 66% of factor variance. This study showed the benefit of including aptitudes and other individual measures as a component of tutor prediction.

Which measure of validity is based on showing a substantial correlation between test scores?

FIGURE 6.5. Stat Lady prediction of students’ postscore. The tutor predicted student scores based on student aptitude and the probability of knowledge of statistics concepts. (Adapted from Shute, 1995.)

All hypotheses were supported (Shute, 1995). A significant aptitude-treatment interaction was found where high-aptitude subjects learned more from Stat Lady than from the lecture environment. Also students who used the tutor reported having more fun learning, perceived the tutor's help as more beneficial and instruction as clearer than did workbook subjects (Shute, 1995). Results showed that both treatment groups learned significantly more than the control group, yet there was no difference between the two treatment groups in outcome performance after three hours of instruction. The unintelligent version improved student's scores by more than two standard deviations compared to their pretest scores. Stat Lady students demonstrated improved prepost test score differences by about the same margin as the traditional lecture approach (i.e., about one standard deviation) and over the same time on task (about three hours).

In the discussion of the Stat Lady evaluation, the authors addressed the lack of difference between the two treatment groups and several limitations (Shute, 1995). Even though there were no differences between the two treatment groups, this was viewed as encouraging because, as a result of a sampling error, students assigned to the Stat Lady condition were at a disadvantage, scoring about 20 points less on the quantitative exam measure compared to the lecture group, and about 25 points less than the control group. The lecture constituted a more familiar learning environment for the control subjects, and the professor administering the lecture had more than 20 years of experience teaching this subject matter. In contrast, this was Stat Lady's first teaching assignment.

The evaluation studies were clean. As a result of extensive pretesting, the tutor was able to immediately skip to a topic that the student had a low probability of knowing. Two major aspects of the system were evaluated in sequence: the diagnostic component was tested first with remediation turned off, and then the remediation was turned on for the second study. Statistically significant improvement in learning was reported by using remediation in addition to cognitive diagnosis. One issue brought out was the absence of a semantic hierarchical structure for the curriculum elements. Because each curriculum element was considered independent of all others, each CE was only effective in encoding rather small declarative or procedural pieces of domain. Potentially important connections within curriculum elements were ignored. Thus, if one chunk of knowledge was difficult for many students, a separate but linked piece of knowledge perhaps connected to that difficult piece of knowledge should have been taught and tested.

The evaluation had several limitations. The evaluation team, consisting of the same people who developed the tutor, evaluated student responses in the final test. Statistics were used to handle the confounding factor of different pretest values. The learning efficiency of the tutor was not shown. Given the nature of the tutor, one might question whether this sort of evaluation actually tested changes in deeper understanding (e.g., ability to reason at a conceptual level and to apply learning to novel problems). The process seemed tautological: a cognitive task was decomposed into a set of problems and the tutor presented questions to the student along with remediation or continued practice as needed, based on diagnosis of the student's mastery of cognitive tasks. The posttest showed that eventually students were able to take actions that were logically consistent with these symbolic, procedural, and cognitive knowledge elements.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B978012373594200006X

Analyzing qualitative data

Jonathan Lazar, ... Harry Hochheiser, in Research Methods in Human Computer Interaction (Second Edition), 2017

11.4.3.1 Validity

Validity is a very important concept in qualitative HCI research in that it measures the accuracy of the findings we derive from a study. There are three primary approaches to validity: face validity, criterion validity, and construct validity (Cronbach and Meehl, 1955; Wrench et al., 2013).

Face validity is also called content validity. It is a subjective validity criterion that usually requires a human researcher to examine the content of the data to assess whether on its “face” it appears to be related to what the researcher intends to measure. Due to its high subjectivity, face validity is more susceptible to bias and is a weaker criterion compared to construct validity and criterion validity. Although face validity should be viewed with a critical eye, it can serve as a helpful technique to detect suspicious data in the findings that need further investigation (Blandford et al., 2016).

Criterion validity tries to assess how accurate a new measure can predict a previously validated concept or criterion. For example, if we developed a new tool for measuring workload, we might want participants to complete a set of tasks, using the new tool to measure the participants’ workload. We also ask the participants to complete the well-established NASA Task Load Index (NASA-TLX) to assess their perceived workload. We can then calculate the correlation between the two measures to find out how the new tool can effectively predict the NASA-TLX results. A higher correlation coefficient would suggest higher criterion validity. There are three subtypes of criterion validity, namely predictive validity, concurrent validity, and retrospective validity. For more details regarding each subtype—see Chapter 9 “Reliability and Validity” in Wrench et al. (2013).

Construct or factorial validity is usually adopted when a researcher believes that no valid criterion is available for the research topic under investigation. Construct validity is a validity test of a theoretical construct and examines “What constructs account for variance in test performance?” (Cronbach and Meehl, 1955). In Section 11.4.1.1 we discussed the development of potential theoretical constructs using the grounded theory approach. The last stage of the grounded theory method is the formation of a theory. The theory construct derived from a study needs to be validated through construct validity. From the technical perspective, construct or factorial validity is based on the statistical technique of “factor analysis” that allows researchers to identify the groups of items or factors in a measurement instrument. In a recent study, Suh and her colleagues developed a model for user burden that consists of six constructs and, on top of the model, a User Burden Scale. They used both criterion validity and construct validity to measure the efficacy of the model and the scale (Suh et al., 2016).

In HCI research, establishing validity implies constructing a multifaceted argument in favor of your interpretation of the data. If you can show that your interpretation is firmly grounded in the data, you go a long way towards establishing validity. The first step in this process is often the construction of a database (Yin, 2014) that includes all the materials that you collect and create during the course of the study, including notes, documents, photos, and tables. Procedures and products of your analysis, including summaries, explanations, and tabular presentations of data can be included in the database as well.

If your raw data is well organized in your database, you can trace the analytic results back to the raw data, verifying that relevant details behind the cases and the circumstances of data collection are similar enough to warrant comparisons between observations. This linkage forms a chain of evidence, indicating how the data supports your conclusions (Yin, 2014). Analytic results and descriptions of this chain of evidence can be included in your database, providing a roadmap for further analysis.

A database can also provide increased reliability. If you decide to repeat your experiment, clear documentation of the procedures is crucial and careful repetition of both the original protocol and the analytic steps can be a convincing approach for documenting the consistency of the approaches.

Well-documented data and procedures are necessary, but not sufficient for establishing validity. A very real validity concern involves the question of the confidence that you might have in any given interpretive result. If you can only find one piece of evidence for a given conclusion, you might be somewhat wary. However, if you begin to see multiple, independent pieces of data that all point in a common direction, your confidence in the resulting conclusion might increase. The use of multiple data sources to support an interpretation is known as data source triangulation (Stake, 1995). The data sources may be different instances of the same type of data (for example, multiple participants in interview research) or completely different sources of data (for example, observation and time diaries).

Interpretations that account for all—or as much as possible—of the observed data are easier to defend as being valid. It may be very tempting to stress observations that support your pet theory, while downplaying those that may be more consistent with alternative explanations. Although some amount of subjectivity in your analysis is unavoidable, you should try to minimize your bias as much as possible by giving every data point the attention and scrutiny it deserves, and keeping an open mind for alternative explanations that may explain your observations as well as (or better than) your pet theories.

You might even develop some alternative explanations as you go along. These alternatives provide a useful reality check: if you are constantly re-evaluating both your theory and some possible alternatives to see which best match the data, you know when your theory starts to look less compelling (Yin, 2014). This may not be a bad thing—rival explanations that you might never find if you cherry-picked your data to fit your theory may actually be more interesting than your original theory. Whichever explanations best match your data, you can always present them alongside the less successful alternatives. A discussion that shows not only how a given model fits the data but how it is a better fit than plausible alternatives can be particularly compelling.

Well-documented analyses, triangulation, and consideration of alternative explanations are recommended practices for increasing analytic validity, but they have their limits. As qualitative studies are interpretations of complex datasets, they do not claim to have any single, “right” answer. Different observers (or participants) may have different interpretations of the same set of raw data, each of which may be equally valid. Returning to the study of palliative care depicted in Figure 11.2, we might imagine alternative interpretations of the raw data that might have been equally valid: comments about temporal onset of pain and events might have been described by a code “event sequences,” triage and assessment might have been combined into a single code, etc. Researchers working on qualitative data should take appropriate measures to ensure validity, all the while understanding that their interpretation is not definitive.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B978012805390400011X

Content Analysis and Television

Erica Scharrer, in Encyclopedia of Social Measurement, 2005

Reliability and Validity in Studies of Television Content

Definitions

Studies that employ the method of content analysis to examine television content are guided by the ideals of reliability and validity, as are many research methods. Reliability has to do with whether the use of the same measures and research protocols (e.g., coding instructions, coding scheme) time and time again, as well as by more than one coder, will consistently result in the same findings. If so, those results can be deemed reliable because they are not unique to the subjectivity of one person's view of the television content studied or to the researcher's interpretations of the concepts examined.

Validity refers primarily to the closeness of fit between the ways in which concepts are measured in research and the ways in which those same concepts are understood in the larger, social world. A valid measure is one that appropriately taps into the collective meanings that society assigns to concepts. The closer the correspondence between operationalizations and complex real-world meanings, the more socially significant and useful the results of the study will be. In content analysis research of television programming, validity is achieved when samples approximate the overall population, when socially important research questions are posed, and when both researchers and laypersons would agree that the ways that the study defined major concepts correspond with the ways that those concepts are really perceived in the social world.

Validity: Categories and Indicators

Valid measures of general concepts are best achieved through the use of multiple indicators of the concept in content analysis research, as well as in other methods. A study of whether television commercials placed during children's programming have “healthy” messages about food and beverages poses an example. There are certainly many ways of thinking about what would make a food or beverage “healthy.” Some would suggest that whole categories of foods and beverages may be healthy or not (orange juice compared to soda, for instance). Others would look at the amount of sugar or perhaps fat in the foods and beverages to determine how healthy they were. Still others would determine healthiness by documenting whether the foods and beverages contain vitamins and minerals. The content analysis codes or categories used to measure the healthiness of the foods and beverages shown in commercials would ideally reflect all of these potential indicators of the concept. The use of multiple indicators bolsters the validity of the measures implemented in studies of content because they more closely approximate the varied meanings and dimensions of the concept as it is culturally understood.

There are two major types of validity. External validity has to do with the degree to which the study as a whole or the measures employed in the study can be generalized to the real world or to the entire population from which the sample was drawn. It is established through sampling as well as through attempts to reduce artificiality. An example of the latter is having coders make some judgments by watching television content only once, rather than stopping and starting a videotaped program multiple times, in order to approximate how the content would be experienced by actual viewing audiences. The other type of validity is internal validity, which refers to the closeness of fit between the meanings of the concepts that we hold in everyday life and the ways those concepts are operationalized in the research. The validity of concepts used in research is determined by their prima facie correspondence to the larger meanings we hold (face validity), the relationship of the measures to other concepts that we would expect them to correlate with (construct validity) or to some external criterion that the concept typically predicts (criterion or predictive validity), and the extent to which the measures capture multiple ways of thinking of the concept (content validity).

Reliability: Training, Coding, and Establishing Intercoder Agreement

A particular strength of content studies of television is that they provide a summary view of the patterns of messages that appear on the screens of millions of people. The goal of a content analysis is that these observations are universal rather than significantly swayed by the idiosyncratic interpretations or points of view of the coder. Researchers go to great lengths to ensure that such observations are systematic and methodical rather than haphazard, and that they strive toward objectivity. Of course, true objectivity is a myth rather than a reality. Yet, content analysis research attempts to minimize the influence of subjective, personal interpretations.

In order to achieve this aim, multiple coders are used in content analysis to perform a check on the potential for personal readings of content by the researcher, or for any one of the coders to unduly shape the observations made. Such coders must all be trained to use the coding scheme to make coding decisions in a reliable manner, so that the same television messages being coded are dealt with the same way by each coder each time they are encountered. Clear and detailed instructions must be given to each coder so that difficult coding decisions are anticipated and a procedure for dealing with them is in place and is consistently employed. Most likely, many pretests of the coding scheme and coding decisions will be needed and revisions will be made to eliminate ambiguities and sources of confusion before the process is working smoothly (i.e., validly and reliably). Researchers often limit the amount of coding to be done by one coder in one sitting because the task may get tiresome, and reliable, careful thought may dwindle over time.

In addition to training coders on how to perform the study, a more formal means of ensuring reliability— calculations of intercoder reliability—is used in content analysis research. The purpose of intercoder reliability is to establish mathematically the frequency with which multiple coders agree in their judgments of how to categorize and describe content. In order to compute intercoder reliability, the coders must code the same content to determine whether and to what extent their coding decisions align. Strategies for determining how much content to use for this purpose vary, but a general rule of thumb is to have multiple coders overlap in their coding of at least 10% of the sample. If they agree sufficiently in that 10%, the researcher is confident that each can code the rest of the sample independently because a systematic coding protocol has been achieved.

A number of formulas are used to calculate intercoder reliability. Holsti's coefficient is a fairly simple calculation, deriving a percent agreement from the number of items coded by each coder and the number of times they made the exact same coding decision. Other researchers use Pearson's correlation to determine the association between the coding decisions of one coder compared to another (or multiple others). Still other formulas, such as Scott's pi, take chance agreement into consideration. There is no set standard regarding what constitutes sufficiently high intercoder reliability, although most published accounts do not fall below 70–75% agreement.

Balancing Validity and Reliability

In studies of television content, the goals of establishing validity and reliability must be balanced. Measures used in content analysis research could be reliable but not valid if they repeatedly uncover the same patterns of findings, but those findings do not adequately measure the concepts that they are intending to measure. Furthermore, often the attempts to approximate the complex understandings of concepts that occur in the social world in research designs strengthen validity at the same time that they threaten reliability, because they are more nuanced and less transparent.

Consider an example in a study of television news coverage of presidential elections. The researcher wants to determine what proportion of the newscast is devoted to coverage of the presidential candidates during election season, as well as whether those candidates receive positive or negative coverage. The former portion of the research question would be relatively straightforward to study and would presumably be easily and readily agreed on by multiple coders. All of the items in the newscast could be counted and the number of items devoted to the presidential candidates could be compared to the total number (similarly, stories could be timed). The latter part of the research question, however, is likely to be less overt and relies instead on a judgment to be made by coders, rather than a mere observation of the conspicuous characteristics of the newscast. Indeed, if the researcher were to operationalize the tone of the coverage on a scale of 1 (very negative) to 5 (very positive), the judgments called for become more finely distinct, and agreement, and therefore reliability, may be compromised. On the other hand, that type of detailed measure enhances validity because it acknowledges that news stories can present degrees of positivity or negativity that are meaningful and potentially important with respect to how audiences actually respond to the stories.

The straightforward, readily observed, overt types of content for which coders use denotative meanings to make coding decisions are called “manifest” content. The types of content that require what Holsti in 1969 referred to as “reading between the lines,” or making inferences or judgments based on connotative meanings, are referred to as “latent” content. The former maximizes reliability and the latter maximizes validity. Although scholars using the method have disagreed about the best way to proceed, many suggest that it is useful to investigate both types of content and to balance their presence in a coding scheme. Coders must be trained especially well for making decisions based on latent meaning, however, so that coding decisions remain consistent within and between coders.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985005053

Audiovisual Records, Encoding of

Marc H. Bornstein, Charissa S.L. Cheah, in Encyclopedia of Social Measurement, 2005

Illustrations

Infant-Mother Interaction

It is uncommon in research for the absolute frequencies of behaviors to be of interest (e.g., whether infants vocalize 7, 10, or 15 times). Although in certain instances population base rates convey meaningful information, researchers typically use relative frequencies to compare individuals or groups (e.g., Who vocalizes more: typically developing or Down's babies?) and to rank individuals or groups on particular variables with an eye to relating the ranks to other variables (e.g., Does infant vocalization predict child language development?). It is assumed, although not established, that continuous coding provides a more accurate reflection of reality than do sampling techniques. Do continuous recording and partial-interval sampling procedures allow investigators to reach similar conclusions regarding the relative frequency and standing of behaviors? Do partial-interval sampling procedures produce reliable estimates of actual frequencies obtained by continuous recording? To address these questions U.S. and Japanese infant-mother dyads were compared in a 2002 methodological investigation by Bornstein that directly contrasted continuous recording, in which actual frequencies of behaviors of infants and mothers were obtained, with partial-interval sampling.

Methods and Procedures

Home observations were conducted identically in all visits. Briefly, mothers were asked to behave in their usual manner and to disregard the observer's presence insofar as possible; besides the trained observer-filmer (always a female native of the country), only the baby and mother were present; and observations took place at times of the day that were optimal in terms of babies' being in awake and alert states. After a period of acclimation, hour-long audiovisual records were made of infants and mothers in naturalistic interaction.

Four infant and four maternal activities were coded in modes (i.e., groups of mutually exclusive and exhaustive behaviors) using a computer-based coding system on four separate passes through the videotapes. Two target infant activities consisted of infant visual exploration of mother or of properties, objects, or events in the environment; blank staring and eyes closed/not looking were also coded. Two other infant activities consisted of nondistress or distress vocalization; bodily sounds and silence were also coded. Two maternal behaviors consisted of the mother's active mother-oriented or environment-oriented stimulation of the infant; not stimulating was also coded. The other two mother behaviors involved vocalization, speaking to the baby in child-directed or in adult-directed speech tones; silence was also coded. Thus, behaviors within a category were mutually exclusive, and any category of behavior could occur at any time. Coding reliabilities (κ) were all acceptable.

Data were first continuously coded. Then, partial-interval sampling data of three intervals were obtained via computer programming from the continuous data set; observe intervals selected for partial-interval sampling were 15, 30, and 45 s, a common range of durations of intervals in partial-interval sampling in the developmental science literature.

Results and Discussion

First, zero-order relations between data coded by partial-interval sampling and data coded continuously were explored. To do this, the bivariate correlations between results obtained via each method were computed. The three forms of partial-interval sampling generally preserved the relative rank obtained by continuous coding. The relation between frequencies coded by partial- interval sampling and by continuous coding was approximately linear.

Second, the possibility of estimating frequencies coded by continuous coding from frequencies coded by partial-interval sampling was examined. To this end, the regression of frequencies derived by continuous coding on frequencies derived by partial-interval sampling within one data set at one time was computed. The parameters of this regression equation were then used to estimate frequencies derived by continuous coding from frequencies derived by partial-interval sampling obtained at a second time. Altogether more than 90% of estimates of frequencies derived by continuous coding were statistically equivalent to the true frequencies, where differences occurred, the difference between the estimated and true scores was never larger than one-fifth of a standard deviation. These results indicate that frequencies derived by continuous coding at the group level can be estimated relatively accurately once the relation between sampling and frequencies derived by continuous coding has been specified. Furthermore, the results of cross-cultural comparisons of infant and maternal behaviors were largely unaffected by the use of frequencies coded by partial-interval sampling as opposed to frequencies coded by continuous coding.

Beyond the insights absolute frequency and duration data can yield about population base rates, social scientists are principally interested in the relative standing of individuals or groups for two reasons: (1) Relative standing allows comparison between individuals or groups and (2) relative standing allows for the examination of predictive validity of individual or group variation over time. In other words, researchers want to know whether or not an effect exists. Most statistical inference is based on comparisons of relative ranking rather than on actual quantity. In the final analysis, using any of the data derived by partial-interval sampling in the cross-cultural comparison produced identical results, in terms of significance, to those obtained using data derived by continuous coding.

Emotion Regulation, Parenting, and Displays of Social Reticence in Preschoolers

The substantive purpose of a 2001 investigation by Rubin, Cheah, and Fox was to determine the extent to which observed social reticence–inhibition in children could be predicted by their disregulated temperament, the observed parenting behaviors, and the interaction between temperament and parenting style. This study also illustrated the use of rating scales in encoding of audiovisual records.

Methods and Procedures

Mothers of 4-year-old children completed the Colorado Child Temperament Inventory, which comprises factors that assess maternal perceptions of dispositional characteristics (e.g., emotionality, activity level, shyness, and soothability). Factors assessing emotionality (five items; e.g., “child often fusses and cries”) and soothability (five items; e.g., “when upset by an unexpected situation, child quickly calms down”) were composited to form an index of emotion disregulation comprising high negative emotionality and low soothability.

Children were assigned to quartets of unfamiliar same-sex, same-age peers and observed in a small playroom filled with attractive toys. Behaviors in the peer play session were coded in 10-s intervals for social participation (unoccupied, onlooking, solitary play, parallel play, conversation, or group play) and the cognitive quality of play (functional, dramatic, and constructive play; exploration; or games-with-rules). For each coding interval, coders selected 1 of 20 possible combinations of cognitive play nested within the social participation categories. The proportion of observational intervals that included the display of anxious behaviors (e.g., digit sucking, hair pulling, or crying) was also coded. Time samples of unoccupied, onlooking, and anxious behaviors were combined to obtain an index of social reticence.

The peer quartet was followed 6–8 weeks later by a visit to the laboratory by each child and his or her mother. All children and mothers were observed in two distinct mother-child situations: During an unstructured free-play session, mother and child were told that the child was free to play with anything in the room (15 min); during a second session, the mother was asked to help guide and teach her child to create a Lego structure that matched a model on the table at which mother and child were seated (15 min). Mothers were asked not to build the model for the child and to refrain from touching the materials during this teaching task, which was thought to be challenging for a 4-year-old. A maternal behavioral rating scale measure was used to assess: (1) proximity and orientation: the parent's physical location with reference to the child and parental nonverbal attentiveness; (2) positive affect: the positive quality of maternal emotional expressiveness toward the child; (3) hostile affect: negative instances of verbal and nonverbal behavior arising from feeling hostile toward the child; (4) negative affect: the negative quality of maternal expressiveness that reflects maternal sadness, fearfulness, and/or anxiety in response to the child's behavior; (5) negative control: the amount of control a mother exerts over the child that is ill-timed, excessive, and inappropriately controlling relative to what the child is doing; and (6) positive control and guidance: the amount that the mother facilitates the child's behavior or provides supportive assistance that is well-timed.

The free-play and Lego-teaching-task sessions were coded in blocks of 1 min each. For each 1-min interval, observers rated each of the maternal behaviors on a three-point scale, with higher maternal behavioral ratings indicating greater maternal expressions of proximity and orientation, positive affect, hostile affect, negative affect, positive control, and negative control. For example, with regard to positive affect, observers gave a score of 1 if no instances of parental affection, positive feeling, or enjoyment were observed; 2 if moderate positive expression or enjoyment was observed; and 3 if the mother expressed outright physical or verbal affection or positive statements of praise for the child.

A free-play solicitous parenting aggregate was created by first standardizing and then adding the following variables: free-play proximity and orientation, free-play positive affect, free-play positive control, and free-play negative control. Lego-teaching-task maternal solicitousness was measured as in free play by standardizing and then adding the following variables: Lego proximity and orientation, Lego positive affect, Lego negative control, and Lego positive control.

Results and Discussion

Socially wary and reticent preschoolers were emotionally disregulated; that is, they had relatively low thresholds for the evocation of negative affect and difficulty in being calmed once emotionally upset. However, emotional disregulation was not a significant predictor of reticence, suggesting that preschool reticence is a product of factors other than dispositional characteristics. Children with mothers who are controlling and oversolicitousness during parent-child free play and emotionally disregulated children whose mothers did not engage in such behaviors during a more stressful teaching task are thought to be more likely to display social reticence in a peer setting. Maternal oversolicitous (overprotective) behavior during the unstructured free-play situation was significantly and positively predictive of children's socially reticent, wary, and shy behaviors—behaviors known to be associated contemporaneously and predictively with anxiety, psychological overcontrol, interpersonal problem-solving deficits, and poor peer relationships. The findings also suggest that a parenting constellation of warmth, proximity, intrusive control, and joint activity was associated with shy, reticent (and inhibited) child behavior. Taken together, these findings indicate that mothers of reticent preschool-age children do not behave in ways that allow their children to develop self-initiated coping techniques.

Another significant maternal contribution to the prediction of preschoolers' socially reticent behavior was the moderating role played by mothers' solicitousness during the putatively stressful, goal-oriented Lego construction task. For children whose mothers offered limited warmth, guidance, and control during the Lego task, emotional disregulation was significantly and positively associated with the display of socially reticent behaviors among peers. The moderating role of parenting was demonstrated by the lack of a significant association between emotional disregulation and social reticence among preschoolers whose mothers provided appropriate control during the Lego paradigm. Thus, in structured situations in which parental goals are task oriented (such as the Lego teaching task), the display of maternal direction and guidance strategies may be normative and appropriate.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B012369398500400X

When there is a substantial correlation between test scores?

If there is a substantial correlation between test scores and job-performance scores, criterion-related validity has been established.

Which method for assessing validity involves giving a measure to applicants then correlating it with some criterion at a later time?

A concurrent validation assesses the validity of a test by administering it to people already on the job and then correlating test scores with existing measures of each person's performance.

How do you determine the validity of a test?

Estimating Validity of a Test: 5 Methods | Statistics.
Correlation Coefficient Method: In this method the scores of newly constructed test are correlated with that of criterion scores. ... .
Cross-Validation Method: ... .
Expectancy Table Method: ... .
Item Analysis Method: ... .
Method of Inter-Correlation of Items and Factor Analysis:.

Which type of validation relates test scores of all applicants to their future performance?

Predictive validity, however, is determined by seeing how likely it is for the test scores of applicants to predict their future job performance. If an employer's selection testing program is truly job-related, it follows that the results of its selection tests should accurately predict job performance.