Jump To Content

LearnHub




Testing Terms

Analytic Scoring: A scoring system in which the hypothesized components of the skills (often wiring) have been analyzed, and it is these components that make up the categories used in scoring.

Benchmark papers: Classic examples of papers typical of the levels they represent in a holistic scoring system. (These typical examples are also referred to as anchor papers.)

Categorical data: Characterizations that divide people or things into groups (such as ‘limited English proficient student†and fluent English proficient studentsâ€). This type of data is also sometimes referred to as nominal data, because the labels serve to name the classes or groups of people or things.

Conversational cloze test: An indirect test of speaking in which a cloze passage is created from the transcription of other test-takers.

Criterion Referenced: An approach to testing in which a given score is interpreted relative to a pre-set goal or objective (the criterion, rather than to the performance of other test-takers.

Discrete-point tests: Assessment instruments in which each item is intended to measure one and only one linguistic element.


Holistic Scoring: A scoring procedure (often used in writing assessment) in which the reader/listener reacts to the student’s composition/oral response as a whole: a single score is awarded to the writing/response.

Mean: the mathematical average in a group of scores. It is often represented by an “X†with bar over it “X†which is sometimes called “X-bar.â€

Modality: The channel of language used (spoken or written).

Multiple-choice items: The items that consist of a stem (the beginning of the item) either three, four or five answer options (with four options probably being the most common format). One, an only one, of the options is correct, and this is called the key. The incorrect options are call distractors.

Multiple trait scoring: A system for scoring writing that assigns scores to various hypothesized components of writing (as does analytic scoring) but which is more contest-specific and involves more rater training and more reader involvement in the instrument development process than is the case with other analytic scoring measurements.

Norming: A process by which graders (raters) are trained to score speech or writing samples using a set scale. The raters independently read/listen to answers submitted by test candidates and score them sousing the scale descriptors. Then the graders/raters compare the scores they have awarded. Any discrepancies are discussed and agreed upon by the group. Then another set of samples are read/ listened to, scored and compared, etc.

Norm Referenced Test: Tests associated with the familiar bell-shaped curve, which is referred to in the phrase “grading on the curve.†In this approach, grades or scores are based on a comparison of the test-takers to a “norming group†carefully selected to be representative of those expected to take the test.


Point-biserial correlation coefficient: A correlation coefficient that use one set of interval data (e.g., test scores) and one set of dichotomous categorical data (that is, categorical data with only two categories â€" thus the “bi†in biserial).

Objective scoring: Scoring procedures that attempt to eliminate the subjectivity involved in grader/rater judgments and therefore reduce the possibility of unreliability of the sort introduced via the scoring process.

Prompt: The topic for an essay/speaking task.

Rater Reliability: The consistency with which raters (graders) use a scoring system. There are two main types of rater reliability:
1. Intra-rater reliability is determined by having the same person evaluate the same date (usually writing of speaking samples on two different occasions and comparing the results to see how similar they are.
2. Inter-rater reliability: Refers to the consistency with which two (or more) raters evaluate the same data using the same scoring criteria.

Raw Data: Records or measurements that have not yet been processed or statistically manipulated in any way.

Reliability: The extent to which a test measures consistently.

Spearman’s rank order correlation coefficient: A correlation statistic that utilizes two sets of ordinal data (or one set of ordinal data and one set of interval data, the latter of which can easily be converted to ordinal data) to determine the relationship between two rankings; also called “Spearman’s r†or Spearman’s rhoâ€)

Standard deviation: a statistic that summarizes the average amount of difference from the mean in any given data set; also the square root of variance. Since variance is symbolized by “s2,†standard deviation (being the square root of s

Standardization: The process of testing a group of people to see the scores that are typically attained. With a standardized test, the participant can compare where that score fell compared to the standardization group's performance. To standardize a test, the normative group must reflect the population for which the test was designed. The group's performance is the basis for the tests norms.

Subjective Scoring: Scoring procedures that involve raters making value judgments about texts produced by the test-takers.

Validity: The extent to which a test measures what it is supposed to measure.

Variance: The technical term that captures the collective amount of the “differentness†in any given set of scores. Variance (usually symbolized by a lower-case “sâ€) is defined as “a measure of dispersion around the mean†(Hemming, 1987:198). Or as Jaeger put it (1990:384), “Variance is an indicator of the spread of scores in a distribution†(a distribution being a set of scores).

Washback: The effect a test has on teaching and learning

There are only eight purposes for conducting language assessments. The eight reasons are:

1. To determine a learner’s potential talent or capacity for learning languages. These types of tests are called an aptitude test. Aptitude tests do not test one’s skill in a particular language but rather they are intended to assess a person’s ability to learn any language. These types of tests are usually standardized.
2. To determine which language a student is able to use best. These types of tests are called language dominance tests. Usually such procedures involve assessing potentially bilingual students in both languages they have been exposed to, in order to see which is the stronger (dominant) language. The results of these types of test are used to determine which language should be used for instruction.
3. To determine how well a student can perform using a language. These types of tests are known as proficiency tests. The problem here is how to define the construct of proficiency. Definitions of proficiency generally involve the concept of overall language use in a variety of circumstances involving all four skills, reading, writing, listening and speaking and all levels of language. One key factor with proficiency tests is that it does not matter how the student became proficient. These types of tests are usually standardized and use criterion-referencing, standards.
4. To determine selection of students that are most likely to succeed. These type tests are called admission or screening tests. These tests are used to select the students most likely to perform well in a particular program. One such test is called the TOEFL. A program would use the test and based the results of the test on the score that they feel would provide the student the knowledge needed to perform well within their program. These tests are usually standardized and use norm referencing, grading on a curve.
5. To determine placement in a class within a learning program. As the name suggests, placement tests are used to define a student’s language skills relative to the levels of a particular program he or she is entering. Placement tests are usually based on content related to the curricula of particular levels within the program in which the student is entering so that a student can be placed in appropriate learning environment(s) within the program. These tests are usually created by
6. To determine a student’s strengths and weaknesses. These types of tests are called diagnostic tests. Diagnostic tests are very closely related to syllabuses of specific classes so that the teachers can decide how to gear the instruction of the class and then to help the students decide what curriculum areas they need to work on. These tests are most often created by teachers or the creator of the text and are not standardized.
7. To determine how well students are doing with material that has been covered in class. These types of tests are called progress tests. These tests or quizzes are employed as part of the ongoing instructional aspect of the course and are very closely tied to the course content. These tests are created by the teacher or the creator of the text and are not standardized.
8. To determine how well students mastered the skills taught in the course. These type tests are called achievement tests. Achievement tests are ideally based on the goals and objectives of the course. These tests are usually in-house tests or tests provided by the creator of the text. These tests are usually administered by the teachers of the class and are not standardized.

Your Comment
Textile is Enabled (View Reference)