Thursday, May 10, 2007

Assessing Education Performance

Types and Limitations of Psychometric Tests


Psychometric refers to any measure of a mental ability, but is also refers to the mathematical and in particular statistical measurements used on psychological data. In regards to this topic it is intelligence tests that are of most interest and relevance.


The assumption behind intelligence tests is that there is a general mental ability (g) which underlies performance on many different types of tests. However it is also believed that alongside this general ability there are specific abilities (s) which can influence certain types of tasks. 'Thus in any intelligent act, "g"' is involved, plus the "s" factor or factors appropriate to that particular act' (Fontana 1995, p. 103). IQ scores are now measured against an average of 100 and that deviation from this norm reflects either lacking in intelligence or possessing more than is usual, for example, if an individual scores 70 on an IQ test they will be regarded as borderline (at the point between 'normal' intelligence and that of experiencing learning difficulties/disabilities. A score of over 130 indicates a very significant improvement in intelligence away from the norm.


A key point in regard to intelligence tests is that they aim to measure underlying ability in regard to intelligence and not the products of specific learning programmes. Attainment tests, such as National Curriculum tests, measure the outcomes, or knowledge demonstrated, after specific programmes of instruction. Intelligence tests attempt to measure abilities, which reflect experience but are not specifically taught as part of the school curriculum.


While there are a number of mental ability tests used in public schools, the two lost frequently used contemporary individual intelligence test batteries are the Stanford-Binet Intelligence Scale (1986) and the Wechsler Intelligence Scale for Children (1992). Different school psychologists may use either of these tests.


The Stanford-Binet Intelligence Scale (1986) has undergone a number oftransformations from the original English-language version developed by Terman (1916). The latest revision of the Stanford-Binet (1986) is the fourth edition. This edition attempts to address some of the criticisms that have been leveled at the Stanford-Binet test and intelligence tests in general.


In responding to these criticisms, the Stanford-Binet Intelligence Scale (1986) generally has avoided using the term intelligence quotient or IQ score. The IQ score has been replaced with the standard age score (SAS). This change in terminology came after the term IQ score was removed from a number of the standardized group intelligence tests. The group tests are now termed mental abilities or cognitive abilities tests. Now, the only major individual intelligence test to consistently use the term IQ score is the Wechsler Battery.


The editors of the Stanford-Binet also have responded to critics by expanding the areas of material covered by the test. The Stanford-Binet had long been criticized as too heavily weighted toward vocabulary and reasoning skills. The new version attempts to correct for such biases by increasing the variety of subtests included in the battery There are now fifteen subtests in the latest Stanford-Binet scale, which are grouped into four ability scales -



  1. The Verbal Reasoning ability scale contains four subtests. These tests are designed to measure the ability to define words; to comprehend the use of items, objects, or events; to determine what is missing in a picture; and to identify differences and similarities in a series of words.

  2. The Abstract/Visual Reasoning ability scale contains four subtests. These tests attempt to measure the ability to compete different visual patterns. The subject is asked to use blocks to complete a pattern or design; to copy figures; to complete a matrix; and to identify what a folded paper object would resemble once it is unfolded.

  3. The Quantitative Reasoning ability scale is composed of three subtests. These tests involve pictorial and verbal arithmetic problems; different types of numerical series with the last two digits in the series absent; and equations that the P must unscramble and solve.

  4. The Short Term Memory scales contain four subtests. These tests involve repeating word for word a sentence that is read aloud; a visual presentation of a stack of beads that must be correctly repeated in a certain sequence; repeating and reversing a series of digits; and a series of pictures of various objects that must be correctly recalled in the order in which they were presented.


The Wechsler Battery of Intelligence tests is divided into three separate versions. The Wechsler Preschool Primary Scale of Intelligence-Revised (WPPSI-R) (1989) is for ages three to seven. The Wechsler Intelligence Scale for Children-Ill (WISC-III) (1992) is for ages seven to sixteen. The Wechsler Adult Intelligence Scale-Revised (WAIS-R) (1981) is for ages sixteen years and older. The WISC-III test is the one most commonly used in public schools. With a few exceptions, the WPPSI-R and WAIS-R follow the same general format.


The WISC-III is divided into two basic sections: verbal and performance. The verbal section examines reasoning and vocabulary skills. The performance section examines visual-spatial skills. The combined score from these two sections yields a full-scale IQ score. Thus, the examiner can obtain three IQ scores from this test.


The verbal section contains six subtests: the Information test, which involves general knowledge questions about the culture and the environment; the iimilarities test, in which the subject is asked to compare two items and deternine the ways in which the items are similar; the Arithmetic test, which involves enting the subject with verbal arithmetic problems in sentence form; the Vocabulary test, in which the subject is asked to define specific words; the .omprehension test, in which the subject is asked what would be appropriate in a given situation (for example, why is it wrong to set off a fire alarm when , there is no fire?); and the Digit Span test, in which the subject is asked to repeat a series of digits and if completed correctly is then asked to repeat another series of digits.


The performance section of the WISC-III contains the following subtests: the Picture Completion test, in which the subject is shown a series of pictures each of which has a part missing that the subject is asked to identify; the Picture Arrangement test, in which the subject is presented with a series of pictures in a mixed-up order that the subject must then arrange in a logical format that tells a coherent story; the Block Design test, in which the subject is presented with a cube with red and white designs (somewhat like a Rubik's cube) that the subject is asked to change into a number of different designs that are displayed on cards; the Object Assembly test, which is like a child's puzzle where pieces are provided that the subject must place together in the correct manner; the Coding test, in which the subject is provided with a series of nonverbal symbols that the subject must copy correctly in a space below the symbol; and the Mazes test (a supplementary test that is not generally used in the standard WISC-III test), which involves the subject correctly tracing a path through a series of mazes.


Limitations -


Validity: Simply, do these tests measure what they claim to measure? One argument is that since intelligence is such a diverse set of abilities (refer to Gardner's ideas on intelligence) then any one test can not hope to cover all aspects of what we may regard intelligence to be. Another issue is that the tests outlined above assume that intelligence is a fixed and global phenomenon, meaning they work with the idea that intelligence effects all aspects of your functioning and in a predictable and static way. Well, what if this is not the case. For example, we know that people can score low in maths tests but do complex calculations when shopping and in other everyday settings (Cumming and Maxwell, 1999). Thus there may be a mismatch between tests scores and ability.


Factors that affect reliability of tests: Comprehension of questions, presence of tester, motivation and self-efficacy in regards to the test, previous educational experience.


Issues regarding ethnic groups: Much controversy surrounds the issue of differences in IQ scores across different ethnic groups. There have been criticisms regarding whether IQ tests are culturally fair. Modem IQ tests have struggled to eliminate such biases. It is also argued that IQ tests are culturally bound, that is they reflect a Western view of intelligence. However, even within Western societies there are differences in IQ scores between ethnic groups. Although individuals from all ethnic groups can be seen at all levels of IQ, the mean IQ of white Americans is higher than that of black Americans. The APA (1996) report that this result is not due to differences in socio-economic status or to obvious biases in test construction. Further they state that there is no evidence to support a genetic interpretation for these findings, but the reason for such differences is not known. In a discussion of such differences it is crucial to recall what IQ tests actually measure. Neisser (1997, P. 1) states that IQ tests 'tap certain abilities that are relevant to success in school and do so with remarkable consistency. On the other hand, many significant cognitive traits - creativity, wisdom, practical sense, social sensitivity - are obviously beyond their reach'.




 


Types of Performance Assessments at Different Ages


Due to UK Governments desire to measure the effectiveness of teaching and learning in schools and so that international comparisons can be maintained, there is a lot of pressure on schools to test their pupils at certain key stages in their educational progress, these are know as SATs (Standard Attainment Tests) and are undertaken at 4/5 (Baseline measure), 7, 11, 14 and then via GCSEs at age 16. These tests tie in with the end of each Key Stage. Following on from the Dearing Report (1994), level description/ descriptors (LDs) were introduced to provide teachers with guidance on the level of knowledge, understanding and skills that students would need to show for attainment of each level. Teachers then use these LDs to produce a ' "best-fit" of their pupils' work to these level descriptions; in other words they will select the level description that most closely fits the work of each pupil' (Baumann et at., 1997, p. 141). Pupils will therefore be assessed as 'working towards', 'working at' or 'working above' the level that matches the Key Stage they are in.


In attempting to encourage learning teachers will often use many types of assessment to develop this process throughout a students' school experience. They will be either formative (giving informal impression of progress during the course) or summative (to summate student performance at the end of a period of study). In addition they can either be norm referenced (scores compared against the norm, either within the school or nationally) or criterion referenced (measured against a set of specific criteria that the student can either achieve or not achieve). So for example a GCSE mock exam can be either formative or summative depending when the exam is taken. It can also be marked against standards set within the school or more likely using an exam mark scheme which can compare student scores against national exam standards.


 


Implications of Assessment and Categorisation


Self-fulfilling Prophecy: This refers to the expectations that teachers have of their students and that these expectations (prophecies) may come try due to how they treat their students. This area of research was initially stimulated by a study undertaken by Rosenthal and Jacobson (1968). Teachers were told on the basis of IQ test results that certain students were identified as late bloomers and that these students would really come on academically. Eight months later IQ tests were given to all students. It was found that the late bloomers had improved their IQ test scores by as much as 30 IQ points, whereas those not indicated as late bloomers showed no significant improvement. The key point in this study is that the students identified as late bloomers were not chosen on the basis of IQ test scores but were in fact chosen randomly from the class register. It would seem that the only difference between the students was the teacher's expectations.


More recent research has not always replicated these findings, however some studies have. For instance, Rosenthal reviewed 242 studies on labelling and found that in 84 of them labelling did affect performance. However, Fuller found that Black girls in a London comprehensive school fought against the labelling process and did better than expected. Thus we can not conclude that assessment and the possible consequences of creating a SFP will always have the same effect on all students. Certainly there seems to be no direct link between teacher expectations based on assessment and student performance; however intermediary variables, such as self-esteem and self-efficacy, may be effected by such expectations; thus contributing to differences in performance.


Segregation or Inclusion: As referred to in the material on Disruptive Behaviours and also on Special Educational Needs, there is an issue as to whether schools and education services segregate or include all students within their systems. Whilst there are many problems with both segregation (isolation, labelling, negative expectations) and inclusion (disruption of learning for others, lack of appropriate resources and assistance) these processes are done on the basis of assessment. Thus we should be aware that assessment and categorisation can directly affect which schools and what type of educational provision is offered to students.


CognitiveBackwash: This refers to the processes that both students engage in when they know how key assessment systems will occur. For example, if the key assessment is an external examination (GSCEs, AS and A2 levels) then students will tend to learn and prepare just for how the exam will assess them, this may mean plenty of Surface learning so that they get the facts correct, and teachers will teach to the assessment criteria, which may mean lots of note taking for students so that all the correct material is covered and plenty of formative assessment using past examination papers. Obviously the conclusion we reach here is that teachers may end up teaching and students learning in certain ways so that external examination assessment leads to success rather than failure. However this may mean that teaching and learning is being dictated by assessment rather than the other way round.


AffectiveBackwash: This is the emotional reaction that both students and teachers can experience depending on the nature of the assessment processes that they face. Believe it or not but many teachers do get highly anxious when their students are going to be assessed by external examinations because they feel that the students performance will be a reflection on their teaching. Though more of a problem is the anxiety and stress experienced by many students when faced with certain types of assessment; this can lead to underperformance and even more anxiety and stress when faced by similar types of assessment in the future. Given that assessment of 7 year olds is now a common occurrence, one can wonder at the experience they have at school!


High Stakes Testing and Cheating: Popham (1987) coined the term high-stakes testing to refer to school districts in the USA where major educational decisions are based on achievement test scores. Such decisions include school funding allocations, placement decisions, streaming and setting, merit pay for teachers, and evaluations of teachers and principals. Popham believed that standardized tests can serve as "instructional magnets." Such "maglets" focus and improve instruction by concentrating it on specific outcomes.


Other researchers disagree with this view, stating that high-stakes testing may improve test scores without a commensurate gain in learning (Cannell, 1988; lepard, 1990; Shepard &r Dougherty 1991). Part of the reason for such a dispancy between test performance and learning is the extensive time spent in preparation for taking the tests.


In their survey of high-stakes testing, Shepard and Dougherty (1991) found that 6 percent of teachers believed that changing incorrect answers to correct ones on answer documents occurred in their school. The study reported that 8 percent of teachers indicated that students who might have trouble on the test were encouraged to be absent in their school. Additional findings indicated that 23 percent of teachers believed that hints to correct answers were given and that 18 percent believed that questions were rephrased to help students in their school.


These types of teacher behaviours are considered unethical by the major professional educational and psychological associations. Such practices compromise the integrity of the tests and call into question the entire educational process.



# posted by art1 @ 10:05 AM 0 comments



Powered by Qumana


No comments: