The first standardized tests
that we know of
were administered in China
over 2,000 years ago
during the Han dynasty.
Chinese officials used them to determine
aptitude for various government posts.
The subject matter included philosophy,
farming,
and even military tactics.
Standardized tests continued to be used
around the world for the next two millennia,
and today, they're used for everything
from evaluating stair climbs
for firefighters in France
to language examinations
for diplomats in Canada
to students in schools.
Some standardized tests measure scores
only in relation to the results
of other test takers.
Others measure performances on how well
test takers meet predetermined criteria.
So the stair climb for the firefighter
could be measured by comparing
the time of the climb
to that of all other firefighters.
This might be expressed in what
many call a bell curve.
Or it could be evaluated with reference
to set criteria,
such as carrying a certain amount
of weight a certain distance
up a certain number of stairs.
Similarly, the diplomat might be measured
against other test-taking diplomats,
or against a set of fixed criteria,
which demonstrate different levels
of language proficiency.
And all of these results can be expressed
using something called a percentile.
If a diplomat is in the 70th percentile,
70% of test takers scored below her.
If she scored in the 30th percentile,
70% of test takers scored above her.
Although standardized tests
are sometimes controversial,
they're simply a tool.
As a thought experiment,
think of a standardized test as a ruler.
A ruler's usefulness
depends on two things.
First, the job we ask it to do.
Our ruler can't measure
the temperature outside
or how loud someone is singing.
Second, the ruler's usefulness depends
on its design.
Say you need to measure the circumference
of an orange.
Our ruler measures length,
which is the right quantity,
but it hasn't been designed with the
flexibility required for the task at hand.
So, if standardized tests are given
the wrong job,
or aren't designed properly,
they may end up measuring
the wrong things.
In the case of schools,
students with test anxiety may have
trouble performing their best
on a standardized test,
not because they don't know the answers,
but because they're feeling too nervous
to share what they've learned.
Students with reading challenges
may struggle with the wording
of a math problem,
so their test results may better reflect
their literacy
rather than numeracy skills.
And students who were confused by examples
on tests that contain
unfamiliar cultural references
may do poorly,
telling us more about the test taker's
cultural familiarity
than their academic learning.
In these cases, the tests may need
to be designed differently.
Standardized tests can also
have a hard time
measuring abstract
characteristics or skills,
such as creativity, critical thinking,
and collaboration.
If we design a test poorly,
or ask it to do the wrong job,
or a job it's not very good at,
the results may not be reliable or valid.
Reliability and validity
are two critical ideas
for understanding standardized tests.
To understand the difference between them,
we can use the metaphor
of two broken thermometers.
An unreliable thermometer
gives you a different reading
each time you take your temperature,
and the reliable but invalid thermometer
is consistently ten degrees too hot.
Validity also depends on accurate
interpretations of results.
If people say results of a test
mean something they don't,
that test may have a validity problem.
Just as we wouldn't expect a ruler
to tell us how much an elephant weighs,
or what it had for breakfast,
we can't expect standardized tests alone
to reliably tell us how smart someone is,
how diplomats will handle
a tough situation,
or how brave a firefighter
might turn out to be.
So standardized tests may help us learn
a little about a lot of people
in a short time,
but they usually can't tell us a lot
about a single person.
Many social scientists worry about
test scores resulting in sweeping
and often negative changes
for test takers,
sometimes with long-term
life consequences.
We can't blame the tests, though.
It's up to us to use the right tests
for the right jobs,
and to interpret results appropriately.