Test Development Process

Standard 1.0 ...[A]ppropriate validity evidence in support of each intended [test score] interpretation should be provided. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

Validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests. Validity is, therefore, the most fundamental consideration in developing tests and evaluating tests. The process of validation involves accumulating relevant evidence to provide a sound scientific basis for the proposed score interpretations. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014, p.11)

A passing score on the MTEL is a requirement for obtaining educator licensure in Massachusetts. The validity of the MTEL is based on the accumulation of evidence that supports the use of the MTEL for making pass/fail determinations within this licensure context. The process of accumulating validity evidence was interwoven throughout the development of the MTEL tests, including the establishment and participation of advisory committees, the definition of test content, the development of test items, and the establishment of passing standards. The test development process included steps designed to help ensure that

  • the test content is aligned with regulations, policy, and practice for Massachusetts public schools,
  • the test items assess the defined content accurately, and are job-related and free from bias, and
  • the passing scores reflect the level appropriate for the use of the MTEL in making pass/fail determinations as a requirement for receiving educator licensure in Massachusetts.

The test development procedures, including the accumulation of validity evidence, are described in the following sections of this manual:

Establishing Advisory Committees

Standard 1.9 When a validation rests in part on the opinions or decisions of expert judges, observers, or raters, procedures for selecting such experts and for eliciting judgments or ratings should be fully described. The qualifications and experience of the judges should be presented. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

Involving Massachusetts educators in test development activities was an important component of establishing a validity basis for the MTEL program. Massachusetts educators served on MTEL advisory committees throughout the development of the tests. The involvement of Massachusetts public school educators and faculty preparing prospective educators contributes to validity by grounding the program in Massachusetts practice and requirements.

Development of the new and updated tests for the MTEL was a collaborative effort involving the Massachusetts Department of Elementary and Secondary Education (the Department), the Evaluation Systems group of Pearson (Evaluation Systems), and Massachusetts educators, including educators who served on Content Advisory Committees (CACs), the Bias Review Committee (BRC), and Qualifying Score Panels.

CACs were charged with reviewing and validating the content of the tests; one CAC was constituted for each test field. Bias prevention was the focus of the BRC, a group of Massachusetts educators who participated in reviews of test materials throughout the development process. Separate Qualifying Score Panels composed of Massachusetts educators (including some members of the CACs and the BRC) were involved in making judgments that were provided to the Massachusetts Commissioner of Elementary and Secondary Education (the Commissioner) for use in setting passing scores for the tests. See Supplemental Test Development Information for further information about the composition of specific advisory committees.

In addition to advisory committees of Massachusetts educators, an MTEL Technical Advisory Committee, composed of national psychometric and testing experts, provides technical review, guidance, and oversight for the Department in the continuing development and implementation of the MTEL. The committee, composed of approximately 3–5 individuals, helps to ensure that the psychometric characteristics of the testing program are appropriate and consistent with AERA/APA/NCME Standards for Educational and Psychological Testing, including by conducting periodic reviews of program data.

Content Advisory Committees

A Content Advisory Committee (CAC) composed of Massachusetts educators (typically about 8–12) associated with the test field was established for each test field. For low-incidence language fields, which are governed by a single set of Regulations for Educator Licensure and Preparation Program Approval and have a shared test design and test objectives, a single committee of educators across languages was convened for the purpose of reviewing test objectives and assessment specifications, and separate committees representing the component fields were convened for the review of the test items.

The CACs included public school educators and faculty engaged in the preparation of prospective educators. Nominations for membership on the committees were elicited from public school administrators, deans at higher education institutions, public school educators, educator organizations, academic and professional associations, and other sources specified by the Department. Evaluation Systems documented the nominations of eligible educators for the Department, which reviewed educator applications to select the educators to invite to serve on the CACs based on their qualifications (e.g., content training, years of experience, accomplishments).

Committee members were selected to include

  • public school educators (typically a majority of the members), and
  • higher education faculty preparing prospective educators (arts and sciences, fine arts, and/or education faculty).

In addition, committee members were selected with consideration given to the following criteria:

  • Representation from different levels of teaching (i.e., early childhood, elementary, middle, and secondary levels)
  • Representation from professional associations and other organizations
  • Representation from diverse racial, ethnic, and cultural groups
  • Representation from females and males
  • Geographic representation
  • Representation from diverse school settings (e.g., urban areas, rural areas, large schools, small schools, charter schools)

The CACs met during the test development process for the following activities:

  • Review of test objectives and assessment specifications
  • Review of content validation survey results
  • Test items review
  • Marker response establishment for open-response items

Bias Review Committee

A Bias Review Committee (BRC) comprising up to 20 Massachusetts educators was established to review new and updated test materials to help prevent potential bias. BRC members are a diverse group of educators who represent individuals with disabilities and the racial, gender, ethnic, and regional diversity of Massachusetts.

The establishment of the BRC mirrored the process for establishing the CACs. Educators were nominated and encouraged to apply for membership by public school administrators, deans at higher education institutions, public school educators, educator organizations, academic and professional associations, and other sources specified by the Department. The Department reviewed educator applications to select the educators based on their qualifications to serve on the BRC.

In general, the committee members were selected to provide representation for the following groups:

  • Diverse racial, ethnic, and cultural groups
  • Persons with expertise in special needs (including representation from persons who are deaf or hard of hearing)
  • Females and males

In addition, committee members were selected with consideration given to representation from

  • public school educators and higher education faculty preparing prospective educators;
  • geographic location; and
  • diverse school settings (e.g., urban areas, rural areas, large schools, small schools, charter schools).

The BRC met during the test development process for the following activities:

  • Review of test objectives and assessment specifications
  • Test items review
  • Marker response selection*
  • Setting qualifying scores*

*For these activities, members of the BRC were invited to participate for test fields in which they are licensed and practicing or preparing candidates for licensure.

The BRC worked on a parallel track with the CACs. Typically, the BRC reviewed materials shortly before the materials were reviewed by the CACs. BRC members were provided with a copy of Fairness and Diversity in Tests (Evaluation Systems, 2009) before beginning their work. The BRC comments and suggestions were communicated to the CAC members, who made the revisions. If a bias-related revision by the CAC differed substantively from what was suggested by the BRC, follow-up was conducted with a member of the BRC to make sure the revision was mutually agreed-upon.

Qualifying Score Panels

A Qualifying Score Panel of Massachusetts educators (typically up to 20) was established for each test field to provide judgments to be used in setting the passing scores for the tests. These panels typically included some members from the CAC for the field and, in some cases, BRC members qualified in the field, as well as additional educators meeting the same eligibility guidelines as the CAC members.

The selection process for the Qualifying Score Panels mirrored the selection process for the CACs.

Panel members were approved by the Department to include

  • public school educators (typically a majority of the members), and
  • higher education faculty preparing prospective educators (arts and sciences, fine arts, and/or education faculty).

In addition, committee members were selected with consideration given to the following criteria:

  • Representation from different levels of teaching (i.e., early childhood, elementary, middle, and secondary levels)
  • Representation from professional associations and other organizations
  • Representation from diverse racial, ethnic, and cultural groups
  • Representation from females and males
  • Geographic representation
  • Representation from diverse school settings (e.g., urban areas, rural areas, large schools, small schools, charter schools)

Members of the panels made recommendations that were used by the Commissioner, in part, in establishing the passing score for each test.

Test Objective Development and Review

Standard 11.2 Evidence of validity based on test content requires a thorough and explicit definition of the content domain of interest. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

Test Objectives

As indicated previously, validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests. The validity evidence for the MTEL focuses on the use of the MTEL for making pass/fail determinations for the purpose of educator licensure. In order to appropriately make those pass/fail determinations, it is important that the test content be explicitly defined and that it align with Massachusetts regulations and other policy and practice regarding educator licensure. The MTEL Test Objectives serve the purpose of providing explicit descriptions of the content eligible to be included on the tests.

The purposes of the test objectives include

  • establishing a link between test content and Massachusetts legal, policy, and regulatory sources;
  • communicating to policymakers, educators, and other stakeholders how standards and expectations for educators in Massachusetts are embodied in the MTEL;
  • presenting an organized summary of subject matter expectations for candidates preparing to take the test as well as higher education faculty responsible for preparing prospective educators; and
  • providing a structure for score reporting and score interpretation.

The MTEL test objectives (available on the MTEL program website) include a table indicating the weighting of the content subareas of the test. A sample is provided below.

MULTI-SUBJECT SUBTEST

Multi-Subject Subtest weighting by number of questions per subarea
Subareas Range of Objectives Approximate Test Weighting
Multiple-Choice
Language Arts 01–05 30%
History and Social Science 06–09 30%
Science and Technology/Engineering 10–14 30%
90%
Open-Response
Integration of Knowledge and Understanding 15 10%

The test objectives provide the subareas, objectives, and descriptive statements that define the content of the test. A sample is provided below.

Subarea I—Language Arts

Objective 0002: Understand American literature and selected literature from classical and contemporary periods.

For example:

  • Recognize historically or culturally significant works, authors, and themes of U.S. literature.
  • Demonstrate knowledge of selected literature from classical and contemporary periods.
  • Recognize literature of other cultures.
  • Recognize elements of literary analysis (e.g., analyzing story elements, interpreting figurative language).
  • Demonstrate knowledge of varied focuses of literary criticism (e.g., the author, the context of the work, the response of the reader).

Preparation of Test Objectives

As an initial step in preparing the MTEL Test Objectives, Evaluation Systems, in partnership with the Massachusetts Department of Elementary and Secondary Education (the Department), systematically reviewed relevant documents that established the basis for the content of the tests and incorporated the content of the documents into the draft test objectives. Relevant documents included the following:

  • 603 CMR 7.00 Regulations for Educator Licensure and Preparation Program Approval. The Regulations for Educator Licensure and Preparation Program Approval specify the subject matter knowledge requirements for educators in each of the areas and grade levels for which educator licenses are granted in Massachusetts. The subject matter knowledge requirements to be covered on the test are outlined in the Regulations. Evaluation Systems prepared the draft objectives so as to align their content and structure with the Regulations. See Supplemental Test Development Information for the Regulations for specific tests.
  • Massachusetts Curriculum Frameworks. The Department and Massachusetts educators have prepared student learning standards from Pre-Kindergarten to twelfth grade in the form of a curriculum framework for each of the following eight areas:
    • Arts (Dance, Music, Theatre, Visual Arts)
    • English Language Arts
    • Foreign Languages
    • Comprehensive Health
    • History and Social Science
    • Mathematics
    • Science and Technology/Engineering
    • Vocational Technical Education
    Each curriculum framework delineates the learning standards, with specific examples, that Massachusetts students in Pre-Kindergarten through twelfth grade should achieve through instruction by their educators. In drafting the test objectives, Evaluation Systems incorporated the subject matter covered in the Massachusetts curriculum frameworks at a level appropriate for the knowledge and skills needed by educators to teach the subject matter in Massachusetts public schools.
  • Additional documents. In preparing the test objectives, Evaluation Systems reviewed additional documents, as appropriate for the field, to supplement the regulations and curriculum frameworks. These documents included national and state standards that exist for a field and, where appropriate, curriculum materials used by Massachusetts public schools and state colleges and universities. Evaluation Systems added subject matter to the draft test objectives based on the review of the additional documents.

Documentation of Correspondence between Test Objectives and Sources

Standard 11.3 When test content is a primary source of validity evidence in support of the interpretation for the use of a test for employment decisions or credentialing, a close link between test content and the job or professional/occupational requirements should be demonstrated. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

Evaluation Systems prepared, as a component of the documentation of the MTEL program for validation purposes, a correspondence chart linking the test objectives to Massachusetts sources from which they were derived. The correspondence charts focus on the links between the test objectives and the relevant Massachusetts regulations and curriculum frameworks. See Supplemental Test Development Information for correspondence charts for specific fields.

Assessment Specifications

Standard 4.2 In addition to describing intended uses of the test, the test specifications should define the content of the test, the proposed test length, the item formats, the desired psychometric properties of the test items and the test, and the ordering of items and sections. Test specifications should also specify the amount of time allowed for testing; directions for the test takers; procedures to be used for test administration, including permissible variations; any materials to be used; and scoring and reporting procedures. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

Assessment specifications documents that describe major aspects of test design and test administration were prepared at the outset of MTEL test development. The assessment specifications contain two sections: an introductory section designed to be consistent across tests, and a test-specific section. The introductory section, prepared by Evaluation Systems and the Department, contains information regarding

  • the background and purpose of the MTEL;
  • the purpose of the test objectives, and the structure of the test content into subareas, objectives, and descriptive statements;
  • test item formats;
  • bias prevention;
  • test composition and length;
  • test administration and testing time;
  • test scoring; and
  • test reporting.

The purpose of the introductory section of the assessment specifications was to provide the BRC and CACs with contextual information about the tests and test operations to better enable them to review assessment materials and conduct other test development tasks. Additionally, the information helped preserve consistency across MTEL tests during the development process.

The field-specific Measurement Notes section of the assessment specifications was drafted by Evaluation Systems for review and revision by the BRC and relevant CAC. This section includes information such as field-specific terminology to be used in the test items, resources to be consulted, specifications regarding item stimuli, and open-response item guidelines. The purpose of the Measurement Notes was to provide a mechanism for communicating agreed-upon item development specifications to test developers.

Bias Review of Test Objectives and Assessment Specifications

Standard 3.2 Test developers are responsible for developing tests that measure the intended construct and for minimizing the potential for tests being affected by construct-irrelevant characteristics, such as linguistic, communicative, cognitive, cultural, physical, or other characteristics. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

The BRC served a central role in helping to safeguard that the tests measure intended constructs and in minimizing the potential for irrelevant characteristics affecting candidates' scores. The BRC was convened at the beginning of the test development process to review the test objectives and assessment specifications to help determine if the materials contained characteristics irrelevant to the constructs being measured that could interfere with some test takers' ability to respond. The BRC used bias review criteria established for the MTEL program regarding content, language, offensiveness, and stereotypes. Committee members were asked to review the proposed test objectives (including subareas, objectives, and descriptive statements) and the Measurement Notes section of the assessment specifications according to the following review criteria:

Objectives

Content: Does the test objective contain nonessential content that disadvantages a person because of her or his gender, race, nationality, ethnicity, religion, age, sexual orientation, disability, or cultural, economic, or geographic background?

Language: Does the test objective contain nonessential language that disadvantages a person because of her or his gender, race, nationality, ethnicity, religion, age, sexual orientation, disability, or cultural, economic or geographic background?

Offensiveness: Is the test objective presented in such a way as to offend a person because of her or his gender, race, nationality, ethnicity, religion, age, sexual orientation, disability, or cultural, economic or geographic background?

Stereotypes: Does the test objective contain language or content that reflects a stereotypical view of a group based on gender, race, nationality, ethnicity, religion, age, sexual orientation, disability, or cultural, economic, or geographic background?

Assessment Specifications

Content: Does any element of the assessment specifications contain nonessential content that disadvantages a person because of her or his gender, race, nationality, ethnicity, religion, age, sexual orientation, disability, or cultural, economic, or geographic background?

Language: Does the nonessential language used to describe any element of the assessment specifications disadvantage a person because of her or his gender, race, nationality, ethnicity, religion, age, sexual orientation, disability, or cultural, economic or geographic background?

Offensiveness: Is any element of the assessment specifications presented in such a way as to offend a person because of her or his gender, race, nationality, ethnicity, religion, age, sexual orientation, disability, or cultural, economic or geographic background?

Stereotypes: Does any element of the assessment specifications contain language or content that reflects a stereotypical view of a group based on gender, race, nationality, ethnicity, religion, age, sexual orientation, disability, or cultural, economic, or geographic background?

The BRC reviewed the draft test objectives and assessment specifications with guidance from Evaluation Systems facilitators, and members were asked to come to consensus regarding any recommended revisions. Recommendations for revisions were presented to the CAC convened for review of the same materials. The CAC was instructed to address all bias-related issues raised by the BRC. If a revision by the CAC differed substantively from what was suggested by the BRC, follow-up was conducted with a member of the BRC to make sure the revision was mutually agreed-upon.

Content Reviews of Test Objectives and Assessment Specifications

The CACs were convened to review the proposed test objectives, including descriptive statements and subareas, as well as the assessment specifications.

For the test objectives, the CAC used review criteria regarding the structure of the test objectives (including the weighting of the content subareas containing the test objectives) and the content of the objectives. Committee members applied the following review criteria established for the MTEL program related to program purpose, organization, completeness, significance, accuracy, freedom from bias, and job-relatedness:

Structure of Test Objectives

Program Purpose: Are the subareas and test objectives consistent with the purpose of the MTEL (i.e., to determine whether prospective educators have the subject matter knowledge required for entry-level teaching in Massachusetts)?

Organization: Are the subareas and test objectives organized appropriately and understandably for the field? Does the structure of the test objectives support an appropriately balanced test? Are the subareas appropriately emphasized in relation to each other?

Completeness: Is the content of the subareas and test objectives complete? Do the subareas and test objectives reflect the subject matter knowledge an educator should have in order to teach? Is there any subject matter that should be added?

Objectives

Significance: Do the test objectives describe subject matter knowledge that is important, and at a level of cognitive complexity that is appropriate, for an educator in Massachusetts?

Accuracy: Do the test objectives accurately reflect the subject matter as it is understood by educators in the field? Are the test objectives stated clearly and accurately, using appropriate terminology?

Freedom from bias: Are the test objectives free of nonessential elements that might potentially disadvantage an individual because of her or his gender, race, ethnicity, nationality, national origin, religion, age, sexual orientation, disability, or cultural, economic, or geographic background?

Job-relatedness: Do the test objectives clearly relate to the subject matter knowledge required for entry-level teaching in Massachusetts?

CAC members were also asked to review the proposed Measurement Notes section of the assessment specifications according to the following review criteria:

Assessment Specifications

Program Purpose: Are the assessment specifications consistent with the purpose of the MTEL (i.e., to determine whether prospective educators have the subject matter knowledge required for entry-level teaching in Massachusetts)?

Significance: Do the assessment specifications describe subject matter knowledge that is important for an educator in Massachusetts?

Accuracy: Do the assessment specifications accurately reflect the subject matter as it is understood by educators in the field? Are the assessment specifications stated clearly and accurately, using appropriate terminology?

Freedom from bias: Are the assessment specifications free of nonessential elements that might potentially disadvantage an individual because of her or his gender, race, ethnicity, nationality, national origin, religion, age, sexual orientation, disability, or cultural, economic, or geographic background?

Job-relatedness: Do the assessment specifications clearly relate to the subject matter knowledge required for entry-level teaching in Massachusetts?

The CAC reviewed and revised the draft test objectives and Measurement Notes section of the assessment specifications through a process of discussion and consensus, with the guidance of an Evaluation Systems facilitator. During the committee discussion, they incorporated revisions suggested by the BRC. Following the committee's consensus review and revision of the test objectives, committee members independently provided a validity rating to verify that the final objectives, as agreed upon in the consensus review, were significant, accurate, free from bias, and job-related. Committee members also had the opportunity to make additional comments regarding the test objectives.

Following the review meeting, Evaluation Systems revised the test objectives and assessment specifications according to the recommendations of the CAC. The Department approved the draft test objectives for use in the content validation survey and the assessment specifications for use in test item development.

Content Validation Surveys

Standard 11.3 When test content is a primary source of validity evidence in support of the interpretation for the use of a test for employment decisions or credentialing, a close link between test content and the job or professional/occupational requirements should be demonstrated. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

The MTEL content validation surveys are an important component of the validity evidence in support of the content of the MTEL. The surveys validate the test objectives that form the basis of test content by ascertaining that job incumbents (i.e., Massachusetts public school educators) and educator experts (i.e., educator preparation faculty) consider the content of each test objective important for entry-level teaching. The surveys provide additional evidence of linkage of the test content to job requirements, beyond the correspondence charts linking the test objectives to the relevant Massachusetts regulations and curriculum frameworks.

The purpose of the survey was to obtain judgments from Massachusetts public school educators and educator preparation faculty about

  • the importance of each objective for entry-level teaching in Massachusetts public schools;
  • how well each set of descriptive statements represents important aspects of the corresponding objective; and
  • how well the set of objectives, as a whole, represents the subject matter knowledge needed for entry-level teaching in Massachusetts public schools.

Survey of Massachusetts Public School Educators

To be eligible to respond to the survey of public school educators, an individual needed to be a licensed, practicing educator in a teaching field associated with the test field. A database of full-time equivalent (FTE) teachers assigned to each teaching field for each school district in Massachusetts was provided by the Massachusetts Department of Elementary and Secondary Education (the Department) for use in drawing a random sample of educators. Typically, 200 public school educators were randomly selected for a field (or the entire population, for fields with fewer than 200 educators statewide). For high-incidence fields such as Communication and Literacy Skills and General Curriculum, 400 public school educators were sampled.

In order to obtain appropriate representation of minority groups (e.g., American Indian, Asian, Black, Hispanic) in the survey results, oversampling of minority groups was conducted for those fields for which the entire population was not sampled. In these instances, the survey results were weighted appropriately to take the oversampling into account.

The population and proposed sample for each field was reviewed by the Department. See Supplemental Test Development Information for further information about the population, sample, and respondents for specific fields.

Advance notification materials were sent to school district superintendents and principals of schools with sampled educators. A subsequent mailing to principals included instruction letters and Web survey access codes for sampled educators. Principals were asked to distribute the materials to the sampled educators, or their replacement for educators no longer employed at the school.

To determine eligibility to complete the survey, recipients responded to the following two questions at the beginning of the survey (with slight variations for certain fields):

  • Do you currently hold a valid Massachusetts teacher license in the field indicated?
  • Are you now teaching this year, or did you teach in the last year, the subject of [field name] in Massachusetts at [specified grade levels]?

Respondents provided background information about current license(s), level of education, gender, ethnicity, academic preparation, and years of professional teaching experience.

The public school educators were asked to respond to the following questions (with slight variations for certain fields):

Objective rating question: How important is the subject matter knowledge described by the objective below for entry-level teaching in this field in Massachusetts public schools?

1 = no importance
2 = little importance
3 = moderate importance
4 = great importance
5 = very great importance

Descriptive statement rating question: How well does the set of descriptive statements represent important aspects of the objective?

1 = poorly
2 = somewhat
3 = adequately
4 = well
5 = very well

Overall rating question: How well does the set of objectives, as a whole, cover the subject matter knowledge required for entry-level teaching in this field in Massachusetts public schools?

1 = poorly
2 = somewhat
3 = adequately
4 = well
5 = very well

For each survey question, participants were asked to provide a comment for any rating less than "3," including noting any additional important subject matter knowledge that should be included.

Evaluation Systems monitored the survey access codes and was able to track responses by school and educator. A number of follow-up activities were conducted with non-responding schools and educators, including telephone calls, emailing or re-mailing survey materials to schools, and extending the deadline for returns.

Survey of Faculty at Massachusetts Institutions of Higher Education

A separate content validation survey was conducted with faculty from institutions with state-approved educator preparation programs that offer courses for specified licensure fields. To be eligible to participate in the survey, an individual needed to be teaching education candidates in the subject matter areas associated with the test field, including arts and sciences faculty, fine arts faculty, and education faculty. Responses from up to 100 faculty were targeted for fields where they existed. (For the high-incidence fields of Communication and Literacy Skills and General Curriculum, 200 faculty were targeted). Typically, institutions were categorized as having high, medium, or low enrollment for a given field, and the number of surveys distributed to each preparation program was based on this categorization.

Advance notification emails were sent to educator preparation program contacts at higher education institutions included in the sample, as well as to the deans of education and arts and sciences. A subsequent mailing with instruction letters and Web survey access codes for sampled educators was sent to the designated contact person at each institution, along with instructions for identifying faculty members eligible to complete the survey.

To determine eligibility to complete the survey, recipients responded to the following question at the beginning of the survey (with slight variation for certain fields):

Are you now teaching, or have you taught, during this or the previous year, undergraduate or graduate arts and sciences, fine arts, or education courses in [field name] in which prospective teachers may have been enrolled?

Respondents provided background information about level of education, gender, ethnicity, academic preparation, years of teaching experience, and type of department or higher education institution appointment.

Faculty members were asked to respond to the same survey rating questions as public school educators.

For each survey question, participants were asked to provide a comment for any rating less than "3," including noting any additional important subject matter knowledge that should be included.

Evaluation Systems monitored the survey access codes and was able to track responses by institution and faculty member. A number of follow-up activities were conducted with non-responding institutions and faculty members, including reminder emails, telephone calls, re-providing survey access codes, and extending the deadline for returns.

Analysis of the Content Validation Surveys

Evaluation Systems analyzed the content validation data for each field separately for public school educators and higher education faculty. The following reports were produced. See Supplemental Test Development Information for Content Validation Survey Reports for specific fields.

Content Validation Survey Population/Sample/Respondents Demographics: Indicates the composition of the educator group for the population, sample, and survey respondents (for public school educator survey only).

Survey Return Rate by Field and Return Status: Indicates the number and percent of surveys distributed and returned.

Number Sent = Number of surveys sent to contact persons

Number Not Distributed = Number of surveys reported as not distributed by contact persons

Returned Surveys Eligible = Number of respondents who answered "yes" to all eligibility questions

Returned Surveys Ineligible = Number of respondents who answered "no" to one or more of the eligibility questions or left an eligibility question blank

Returned Surveys Incomplete = Number of respondents who answered "yes" to both eligibility questions but did not provide any objective ratings

Returned Surveys Total = Total number of eligible, ineligible, and incomplete surveys returned

Adjusted Return Rate = Number of eligible respondents divided by the number of surveys sent minus number not distributed, returned ineligible, or returned incomplete

Demographic Summary Report: Indicates participant responses to the eligibility and background information questions.

Absolute Frequency = The number of respondents selecting each response category, including no response

Relative Percent = The percent of respondents selecting each response category, including no response (number selecting each response category divided by number of eligible returned surveys)

Adjusted Percent = The percent of respondents selecting each response category, excluding no response (number selecting each response category divided by number of eligible returned surveys minus no response)

Objective Rating Report: Indicates the average importance rating given to each objective and the average across all objectives. For fields in which oversampling of minority groups was conducted for the public school educator survey, weighted data are provided to appropriately take the oversampling into account.

N = Number of respondents

Importance Ratings: Mean = Mean rating by respondents on the 1–5 scale

Importance Ratings: S.D. = Standard deviation of ratings by respondents

Importance Ratings: S.E. = Standard error of ratings by respondents (for unweighted data only)

Response Distribution (in %) = Percent of respondents selecting each rating 1–5 and no response (NR)

Descriptive Statement Rating Report: Indicates the average importance rating given to each set of descriptive statements and the average across all sets of descriptive statements. For fields in which oversampling of minority groups was conducted for the public school educator survey, weighted data are provided to appropriately take the oversampling into account.

N = Number of respondents

Importance Ratings: Mean = Mean rating by respondents on the 1–5 scale

Importance Ratings: S.D. = Standard deviation of ratings by respondents

Importance Ratings: S.E. = Standard error of ratings by respondents (for unweighted data only)

Response Distribution (in %) = Percent of respondents selecting each rating 1–5 and no response (NR)

Composite Rating Report: Indicates the average rating given to the set of objectives as a whole. For fields in which oversampling of minority groups was conducted for the public school educator survey, weighted data are provided to appropriately take the oversampling into account.

N = Number of respondents

Importance Ratings: Mean = Mean rating by respondents on the 1–5 scale

Importance Ratings: S.D. = Standard deviation of ratings by respondents

Importance Ratings: S.E. = Standard error of ratings by respondents (for unweighted data only)

Response Distribution (in %) = Percent of respondents selecting each rating 1–5 and no response (NR)

Objective Ratings Summary: Combines into one report the average objective importance ratings given by public school educators and faculty (prepared for review by MTEL CACs).

Number Teachers = Number of public school educator respondents

Number Faculty = Number of faculty respondents

Mean Objective Rating: Teachers = Mean rating by public school educator respondents on the 1–5 scale

Mean Objective Rating: Faculty = Mean rating by faculty respondents on the 1–5 scale

Description of Objective = Objective text

Respondent comments regarding the objectives, descriptive statements, and set of objectives as a whole were sorted and categorized to facilitate review (e.g., sorted by relevant objective).

The analyses of survey return rates, demographic summaries, survey ratings, and participant comments were provided to the Department for review. The Department determined if any changes were warranted to test objectives based on the survey results (e.g., additions or revisions to content or terminology in a descriptive statement). Objectives that both public school educators and educator preparation faculty indicated were important (objectives with mean importance ratings of 3.00 or higher for each respondent group) were considered eligible for inclusion on the test. Any objectives with a mean importance rating of less than 3.00 for either respondent group were identified for further review and discussion (e.g., with MTEL advisors or committee members).

Test Item Development and Review, Pilot Testing, and Marker Establishment

Standard 4.7 The procedures used to develop, review, and try out items and to select items from the item pool should be documented. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

For test fields being newly developed, the test items were newly written and underwent a set of rigorous reviews, as described in this section. For test fields being updated, test items were newly developed and/or continued from an existing bank. All the items, both new and existing, underwent the full review process.

Test Item Preparation

Test item preparation combined the expertise of content specialists (i.e., experts in the field-specific content areas) and experienced item development specialists. Evaluation Systems supervised the item drafting process, which involved program staff, content experts, psychometricians, and item development specialists. Item development teams were provided with program policy materials (e.g., Massachusetts regulations, curriculum frameworks), committee-approved test objectives and assessment specifications, Evaluation Systems item preparation and bias prevention materials, and additional materials as appropriate for the field (e.g., textbooks, online resources). For fields being updated, items that matched the new test objectives and had appropriate psychometric characteristics were eligible to be retained in the bank for review by MTEL advisory committees. Additionally, some items from the previous bank were revised and included in the new bank for review by advisory committees.

Preliminary versions of test items were reviewed by specialists with content expertise in the appropriate field (e.g., teachers, college faculty, other specialists) as a preliminary check of the items' accuracy, clarity, and freedom from bias. Test items were provided to the Massachusetts Department of Elementary and Secondary Education (the Department) for review, and any changes or suggestions made by the Department were incorporated or provided to the advisory committees for consideration at review meetings.

Bias Review of Test Items

Standard 3.2 Test developers are responsible for developing tests that measure the intended construct and for minimizing the potential for tests being affected by construct-irrelevant characteristics, such as linguistic, communicative, cognitive, cultural, physical, or other characteristics. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

The BRC was convened to review draft test items to help safeguard that the test items measured the intended constructs and to minimize characteristics irrelevant to the constructs being measured that could interfere with some test takers' ability to respond. The BRC reviewed items according to the following established bias review criteria for the MTEL regarding content, language, offensiveness, and stereotypes:

Content: Does the test item contain nonessential content that disadvantages a person because of her or his gender, race, nationality, ethnicity, religion, age, sexual orientation, disability, or cultural, economic, or geographic background?

Language: Does the test item contain nonessential language that disadvantages a person because of her or his gender, race, nationality, ethnicity, religion, age, sexual orientation, disability, or cultural, economic, or geographic background?

Offensiveness: Is the test item presented in such a way as to offend a person because of her or his gender, race, nationality, ethnicity, religion, age, sexual orientation, disability, or cultural, economic, or geographic background?

Stereotypes: Does the test item contain language or content that reflects a stereotypical view of a group based on gender, race, nationality, ethnicity, religion, age, sexual orientation, disability, or cultural, economic, or geographic background?

The BRC reviewed the draft test items with the guidance of Evaluation Systems facilitators and were asked to come to consensus regarding any recommended revisions. The BRC also had the opportunity to submit content-related questions for consideration by the CAC. Recommendations for revisions and content-related questions were presented to the CAC convened for review of the same materials. The CAC was instructed to address all bias-related issues raised by the BRC. If a revision by the CAC differed substantively from what was suggested by the BRC, follow-up was conducted with a member of the BRC to make sure the revision was mutually agreed-upon or the item was deleted.

Content Review of Test Items

Standard 4.8 The test review process should include empirical analyses and/or the use of expert judges to review items and scoring criteria. When expert judges are used, their qualifications, relevant experiences, and demographic characteristics should be documented, along with instructions and training in the item review process that the judges receive. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

The CACs, composed of Massachusetts public school educators and faculty preparing prospective educators associated with the test field, served as expert judges to review the test items. See Establishing Advisory Committees for further information about the qualifications of the committee members.

CAC members reviewed draft test items according to review criteria established for the MTEL program regarding objective match, accuracy, freedom from bias, and job-relatedness. Additionally, committee members provided final recommendations of test subarea proportions based on results of the content validation survey.

For their review of subarea proportions, committee members were provided with Objective Ratings Summaries that summarized the mean importance ratings for each objective and the overall mean objective rating for the field. See Supplemental Test Development Information for Objective Ratings Summaries for specific fields. Committee members were asked to make their final recommendations for weighting each of the multiple-choice section subareas based on their original recommendations and the survey results.

For their review of draft test items, committee members applied the following criteria:

Multiple-Choice Item Review Criteria

Objective Match:

  • Does the item measure an important aspect of the objective(s)?
  • Is the level of difficulty appropriate for the testing program?
  • Are the items, as a whole, consistent with the purpose of the MTEL program?

The item should measure a significant aspect of the stated objective. It need not cover the entire objective, but it should measure an important aspect of the objective. The level of difficulty of the item should be appropriate for entry-level teaching in Massachusetts public schools. The reading level of the item should also be appropriate for entry-level educators.

Accuracy:

  • Is the content accurate?
  • Does the answer key identify the correct response?
  • Are the distractors plausible yet clearly incorrect?
  • Is the terminology appropriate?
  • Is the item grammatically correct?
  • Are the stem and response alternatives clear in meaning?
  • Is the wording of the item stem free of clues that point toward the correct answer?
  • Is the graphic (if any) accurate and relevant to the item?

The item's subject matter, terminology, and grammar must be accurate. An item must have a correct (or best) answer. All distractors for an item must be plausible responses, yet clearly incorrect or not the best answer. An item should not be tricky, purposely misleading, or ambiguous. The stem should not include clues to the correct response. Any graphic accompanying an item must be accurate and appropriate for the item.

Freedom from Bias:

  • Is the item free of nonessential language, content, or stereotypes that might potentially disadvantage or offend an individual because of her or his gender, race, ethnicity, nationality, national origin, religion, age, sexual orientation, disability, or cultural, economic, or geographic background?
  • Are the items, as a whole, fair to all individuals regardless of gender, race, ethnicity, nationality, national origin, religion, age, sexual orientation, disability, or cultural, economic, or geographic background?
  • As a whole do the items include subject matter that reflects the people of Massachusetts?

An individual's ability to respond to the item should not be hindered by her or his gender, race, ethnicity, nationality, national origin, religion, age, sexual orientation, disability, or cultural, economic, or geographic background. (Note: In some cases, the inclusion of a specific gender, race, or ethnicity is a function of the requirement of the objective for which the item was written.)

Job-Relatedness:

  • Is the subject matter job-related?
  • Does the item measure subject matter knowledge that an educator needs on the job in Massachusetts public schools?
  • Does the item measure subject matter knowledge that an educator should be expected to know at the entry level (i.e., not learned on the job)?

The item should measure important subject matter knowledge that an educator needs to function on the job. This may be subject matter that is taught or in some other way used by an educator in carrying out the job. The subject matter should reflect knowledge that may be expected of an educator at the entry level, not subject matter that would be learned later, on the job.

 

Open-Response Item Review Criteria

Objective Match:

  • Does the item measure important aspects of the test area as defined by the set of objectives?
  • Is the level of difficulty appropriate for the testing program?
  • Are the items, as a whole, consistent with the purpose of the MTEL program?

Accuracy:

  • Is the content of the item accurate?
  • Is the terminology appropriate?
  • Is the item clearly stated?
  • Is the graphic (if any) accurate and relevant to the item?

Freedom from Bias:

  • Is the item free of nonessential language, content, or stereotypes that might potentially disadvantage or offend an individual because of her or his gender, race, ethnicity, nationality, national origin, religion, age, sexual orientation, disability, or cultural, economic, or geographic background?
  • Are the items, as a whole, fair to all individuals regardless of gender, race, ethnicity, nationality, national origin, religion, age, sexual orientation, disability, or cultural, economic, or geographic background?
  • As a whole, do the items include subject matter that reflects the people of Massachusetts?

Job-Relatedness:

  • Is the subject matter job-related?
  • Does the item measure subject matter knowledge that an educator needs on the job in Massachusetts schools?
  • Does the item measure subject matter knowledge that an educator should be expected to know at the entry level (i.e., not learned on the job)?

The CAC reviewed and revised the draft test items through a process of discussion and consensus, with the guidance of an Evaluation Systems facilitator. During the discussion, the committee incorporated revisions suggested by the BRC. Following the committee's review of each item and documentation of any changes made to the item, committee members independently provided a validity rating to verify that the final item, as agreed upon in the consensus review, was matched to the objective, accurate, free from bias, and job-related. Committee members also had the opportunity to make additional comments related to the review criteria.

Following the item review meetings, Evaluation Systems reviewed the item revisions and validity judgments and revised the test items according to the recommendations of the CACs. Evaluation Systems documented the BRC recommendations and resolutions of the recommendations; additionally, any post-conference editorial revisions (beyond typographical revisions) were documented (e.g., rewording a committee revision for clarity or consistency with other response alternatives). The documentation of revisions was submitted to the Department for final approval of the test items. The revised test items were then prepared for pilot testing throughout Massachusetts.

Pilot Testing

Standard 4.8 The test review process should include empirical analyses and/or the use of expert judges to review items and scoring criteria.

Standard 4.9 When item or test form tryouts are conducted, the procedures used to select the sample(s) of test takers as well as the resulting characteristics of the sample(s) should be documented. The sample(s) should be as representative as possible of the population(s) for which the test is intended. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

In addition to the review of test items by expert judges (members of the BRC and CACs), when permitted by sufficient candidate populations, empirical data were collected about statistical and qualitative characteristics of the new items through pilot testing. This additional information, when feasible to collect and support meaningful analyses, was used to refine the item banks before the items were used on a scorable basis on operational test forms.

Where the test design and number of candidates allowed, new and revised test items for tests being updated were pilot tested on operational test forms in non-scorable slots. Candidates were unaware which items were scorable and which were non-scorable. Thus, the sample of pilot test takers mirrored in composition and motivation the operational test takers.

For some fields, pilot testing was conducted with volunteer participants before the items were introduced on operational test forms. Pilot tests were administered to candidates with characteristics generally mirroring those of candidates who would eventually take the MTEL operational test forms. Eligible participants included candidates in educator preparation programs planning to seek Massachusetts licensure in the test fields being pilot tested, students with postsecondary education in the selected fields, registered candidates for operational MTEL administrations in the selected fields, and persons recently licensed in the pilot test fields. Candidates were given an incentive for participating, such as a gift card or a voucher to offset future testing fees.

Pilot testing was conducted at stand-alone events at MTEL operational test administrations (including with candidates registered to take operational tests) and at Massachusetts higher education institutions; pilot test events were scheduled to accommodate as wide a range of student needs as possible. Pilot testing was typically supervised by Evaluation Systems trained test administrators. The pilot tests were administered under testing conditions approximating operational administrations. See Pilot Test Events for further information about pilot testing for specific fields.

Pilot test forms were designed to allow participants to complete the test in a reasonable amount of time, typically one-and-a-half to two-and-a-half hours for a form or set of forms, in order to minimize any effects on the data from participant fatigue. Multiple pilot test forms were prepared for each field to allow for adequate piloting of test items. Pilot test forms with more than one open-response item were counterbalanced (i.e., the order of the items was reversed on every other form), and pilot test forms were typically spiraled for random distribution to participants.

Pilot test responses to the multiple-choice items were scored electronically based on the answer keys, and the following item statistics were generated:

  • Individual item p-values (percent correct)
  • Item-to-test point-biserial correlation
  • Distribution of participant responses (percent of participants selecting each response option)
  • Mean score by response choice (average score on the multiple-choice set achieved by all participants selecting each response option)
  • Mantel-Haenszel DIF analysis (when the number of participants for the focal and comparison groups [gender and ethnicity] is greater than or equal to 30 for each group)

The statistical analyses identified multiple-choice items with one or more of the following characteristics:

  • The percent of the candidates who answered the item correctly is less than 30 (i.e., fewer than 30 percent of candidates selected the response keyed as the correct response) (N ≥ 5)
  • Nonmodal correct response (i.e., the response chosen by the greatest number of candidates is not the response keyed as the correct response) (N ≥ 5)
  • Item-to-test point-biserial correlation coefficient is less than 0.10 (if the percent of candidates who selected the correct response is less than 50) (N ≥ 25); or
  • The Mantel-Haenszel analysis indicated that differential item functioning (DIF) was present

Item data for identified items were reviewed, and when warranted, further reviews were conducted, including

  • confirmation that the wording of the item was the same as the wording approved by the CAC,
  • a check of content and correct answer with documentary sources, and/or
  • review by a content expert.

Pilot test open-response items were scored by Massachusetts educators meeting the eligibility criteria for MTEL operational scorers. Scoring procedures approximated those of operational administrations. For open-response items with 25 or more responses, Evaluation Systems produced statistical descriptions and analyses of item performance, including the following:

  • Mean score on the item
  • Standard error of the mean score
  • Standard deviation of the mean score
  • Percent distribution of scores
  • Analysis of variance (ANOVA) to detect item main-effects differences
  • Analysis of variance (ANOVA) for item-by-participant group interactions (provided that the number of responses for each group is greater than or equal to 25)
  • A test to identify items with mean scores that are significantly statistically different from the others (Tukey HSD analysis)
  • Rate of agreement among scorers

In addition, the following qualitative analyses were conducted and reported:

  • Items that elicited a high number of blank, short, incomplete, or low-scoring responses
  • Items that scorers identified as difficult to score
  • Items with a high number of scorer discrepancies
  • Items that participants identified in participant questionnaires as difficult, unfair, or of poor quality

For fields with insufficient candidate populations to allow statistical analyses of the open-response items, an attempt was made to obtain five or more responses to the items for the purpose of conducting a qualitative review. The review of responses was done by Massachusetts educators meeting the eligibility criteria for MTEL operational scorers and focused on determining whether the items appeared to be clear and answerable to pilot test participants.

Multiple-choice and open-response items with the appropriate statistical characteristics, based on the pilot test analyses, were included in the final item bank and were available for inclusion on operational test forms. Items identified for further review may have been deleted or retained, based on the results of the review. See Pilot Test Outcomes for further information about pilot test outcomes for specific fields.

Establishment of Marker Responses for Open-Response Items

Standard 4.8 The test review process should include empirical analyses and/or the use of expert judges to review items and scoring criteria. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

As part of the establishment of scoring materials for the open-response items, members of the CACs were convened to establish marker responses, which are exemplars of the points on the scoring scale. The use of the marker responses in the training of scoring personnel, together with the standardized scoring scale, helps to promote continuity and consistency in scoring over time, and across test forms, test administrations, and scorers. The marker responses help to ensure that scores retain a consistent meaning over time, and that candidates' responses are judged similarly regardless of when they take a test or which test form they take.

A subset of the CAC (generally about 6–8 members) met to review responses to open-response items that were typically created by pilot test participants. The purpose of the meeting was to identify a set of responses to represent the points on the scoring scale for use in the training of scorers. Before beginning their task, committee members reviewed test directions, scoring procedures, and the scoring scales for the items. The committee members established the marker responses through a process of discussion and consensus, with the guidance of an Evaluation Systems facilitator. Committee members could select or modify responses from the pilot test, or create responses if needed.

Determination of Qualifying Scores

Standard 5.21 When proposed score interpretations involve one or more cut scores, the rationale and procedures used for establishing cut scores should be documented clearly.

Standard 5.22 When cut scores defining pass-fail or proficiency levels are based on direct judgments about the adequacy of item or test performances, the judgmental process should be designed so that the participants providing the judgments can bring their knowledge and experience to bear in a reasonable way. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

Qualifying Score Conference

MTEL test scores are used to make pass/fail classifications within a licensing context. The passing score for each test (also called a cut score or qualifying score) is set by the Massachusetts Commissioner of Elementary and Secondary Education (the Commissioner), based, in part, on judgments made by a Qualifying Score Panel of Massachusetts educators at a Qualifying Score Conference. The procedures used to establish the qualifying scores are based on a process commonly used for licensing and credentialing tests. Those procedures are described below.

A Qualifying Score Panel of Massachusetts educators was convened for each test field to provide judgments to be used in setting the qualifying score for each test. These panels of up to 20 educators typically included some members from the CAC for the field and, in some cases, BRC members qualified in the field, as well as additional educators meeting the eligibility guidelines. See Establishing Advisory Committees for further information about the Qualifying Score Panels.

An iterative procedure was used in which standard-setting ratings were gathered in two rounds, using procedures commonly referred to as a modified Angoff procedure and the extended Angoff procedure. In the first round, panel members provided item-by-item judgments of the performance of "just acceptably qualified" candidates on the items from the first operational test form. In the second round, panel members reviewed the results from the initial round of ratings and candidate performance on the items. Panel members were then given an opportunity to make revisions to their individual round-one item ratings.

Orientation. Panel members were given an orientation that explained the steps of the qualifying score process, the materials they would use, the concept of the "just acceptably qualified candidate," and the judgments about test items that they would be asked to make. Panelists also completed a training exercise, including rating items with a range of item difficulty, to prepare them for the actual rating activity. The role of the Commissioner in setting the passing score was also explained.

Simulated test-taking activity. To familiarize the panel members with the knowledge associated with the test items, each panelist simulated the test-taking experience. Panelists were provided with a copy of the first operational test form and were asked to read and answer the questions on the test without a key to the correct answers. After panelists completed this activity, they were provided with the answer key (i.e., the correct responses to the questions on the test) and were asked to score their own answers.

Round one—item-based judgments: multiple-choice items and short-answer items. The Evaluation Systems facilitator provided training in the next step of the qualifying score process, in which panel members make item-by-item judgments using a modified Angoff procedure. Panel members were asked to make a judgment regarding the performance of "just acceptably qualified candidates" on each test item.

Panelists were provided with the following description of the hypothetical group of "just acceptably qualified candidates" that they were asked to envision in making their qualifying score judgments (with slight variations for certain fields):

Hypothetical "Just Acceptably Qualified Candidates"

A certain amount of subject matter knowledge is required for entry-level teaching in Massachusetts. Many individuals seeking Massachusetts teaching licenses will exceed the subject matter knowledge of the "just acceptably qualified candidate," but the individuals you use as a hypothetical reference group for your judgments should be "just acceptably qualified candidates." These are candidates who are just at the level of subject matter knowledge required to receive a Massachusetts teaching license for entry-level teaching in Massachusetts.

Please recognize that the point you are defining as "just acceptably qualified" is not necessarily at the middle of a continuum of subject matter knowledge. As defined by the Massachusetts Department of Elementary and Secondary Education, a candidate entering teaching in Massachusetts will be:

  • eligible to teach all possible courses a school district in Massachusetts might offer at all grade levels covered by the license;
  • expected to know all subject matter as defined by the test objectives;
  • expected to be able to teach all subject matter as defined by the student curriculum frameworks, if available, for the courses and levels covered by the license;
  • expected to be able to teach students at a level in keeping with the high standards set for Massachusetts public school students embodied in curriculum requirements and the Massachusetts student assessment system (e.g., MCAS, PARCC);
  • highly qualified, per No Child Left Behind legislation; and
  • expected to teach academically advanced students at the highest grade levels covered by the license as well as the least academically proficient students likely to be in those grades.

Each panel member indicated on a rating form the percent of this hypothetical reference group who would provide a correct response for each item. Panelists provided an independent rating for each item by answering, in their professional judgment, the following question:

"Imagine a hypothetical group of individuals who are just at the level of subject matter knowledge required for entry-level teaching in this field in Massachusetts public schools. What percent of this group would answer the item correctly?"

0%–10% = 1
11%–20% = 2
21%–30% = 3
31%–40% = 4
41%–50% = 5
51%–60% = 6
61%–70% = 7
71%–80% = 8
81%–90% = 9
91%–100% = 10

For fields with short-answer items that are scored as correct/incorrect (e.g., some language fields), those items were included in this step of the qualifying score process.

Round one—item-based judgments: open-response items. Panelists made similar judgments regarding the open-response item(s) on the test form they reviewed, using a procedure known as the extended Angoff procedure. The scoring of open-response items was explained to panelists. The training included a review and discussion of the performance characteristics and scoring scale used by scorers, examples of marker responses used to train scorers, how item scores are combined, and the total number of points available for the open-response section of the test.

Panel members provided an independent rating by answering, in their professional judgment, the following question:

"Imagine a hypothetical individual who is just at the level of subject matter knowledge required for entry-level teaching in this field in Massachusetts public schools. What score represents the level of response that would be achieved by this individual?"

Panel members provided their judgments based on combined item scores (e.g., 8 points for a 4-point item scored by two scorers) and, for some fields, multiple items (e.g., 2–16 points based on 2 items × 2 scorers × 4 points). For the Communication and Literacy Skills test, panelists also made a judgment regarding the set of short-answer items (worth 2 points each) in this round.

Analysis of round one results. After the panelists completed their multiple-choice and open-response item ratings, their rating forms were analyzed. Item Rating Summary Reports were produced for each panelist, containing, for each multiple-choice item and the open-response section: a) the panelist's rating of the item or section, b) the median rating of all panelists who rated the item or section, and c) the frequency distribution of the item or section ratings. Panelists were given an explanation of how to read and interpret the report, including how the ratings would be translated into recommended performance levels for each test.

Round two—revisions to item-based judgments. In the second round of judgments, panel members had the opportunity to revise any of their item ratings from round one. In addition to the Item Rating Summary Reports, item difficulty information for multiple-choice items was provided to panel members in the form of candidate performance statistics from the first operational administration period of the new tests. Panelists reviewed the results from the initial round of ratings and candidate performance on the items and had the opportunity to provide a second rating to replace the first rating for any multiple-choice item and the open-response item(s). Changes to ratings were made independently, without discussion with other panelists.

Evaluation form. Following the rating activities, panel members completed an evaluation form that asked them to provide judgments about the Qualifying Score Conference. On a 5-point scale (1 = not at all satisfied/confident, to 5 = very satisfied/confident), panelists were asked to rate the conference training, their confidence in the judgments made, time to complete the work, coordination and logistics of the conference, and their satisfaction with the qualifying score process. They were also provided space to make comments. Results from the evaluation forms were compiled and provided to the Department. See Supplemental Test Development Information for information about judgments for specific test fields.

Analysis of Qualifying Score Judgments

Following the Qualifying Score Conference, Evaluation Systems calculated recommended performance levels for the multiple-choice and open-response sections of each test based on the ratings provided by the Qualifying Score Panel members. These calculations were based on the panelists' final rating on each item (i.e., either the unchanged first-round rating or the second-round rating if it was different from the first-round rating). See Calculation of Recommended Performance Levels more information for further information regarding the calculation of qualifying score judgments.

Finalizing Passing Scores

Evaluation Systems provided the Massachusetts Department of Elementary and Secondary Education (the Department) with a Preliminary Pass Rate Analysis for the Commissioner's use in establishing the qualifying score for the tests. The analysis included the following information:

  • The number of candidates per field who tested in the first operational test administration period
  • The multiple-choice passing score at the panel-recommended performance level, and at the panel-recommended performance level plus one and two Standard Errors of Measurement (SEMs), and at minus one and two SEMs
  • The open-response passing score at the panel-recommended performance level, and at the panel-recommended performance level plus one and two SEMs, and at minus one and two SEMs
  • The percent of candidates at or above the combined passing score for the test when the multiple-choice and open-response item sections are combined
  • The percent of candidates at or above the combined passing score for the test broken down by gender and ethnicity (Data were provided only if there was a minimum of 10 candidates per category.)
  • Interpretive notes for reading the Preliminary Pass Rate Analysis, including definitions of terms and interpretive cautions

Along with the Preliminary Pass Rate Analysis, a Qualifying Score Conference Report was prepared for the Commissioner describing the participants, process, and results of the qualifying score activities, considerations related to measurement error, and use of the passing scores in scoring and reporting the MTEL. The Commissioner set the passing score for each test based upon the panel-based recommendations and other input. The passing score was applied to the first and subsequent operational administration periods for each test.


Top of Page