Artificial intelligence is passing us by Medical Licensing Examination. “ChatGPT Pass the law school exams “Average” performance though. “Do you get ChatGPT MBA at Wharton? “
Such headlines have recently touted (and often exaggerated) the successes of ChatGPT, an AI tool capable of writing complex text responses to human prompts. These successes follow a long tradition of comparing the ability of artificial intelligence to that of human experts, such as Deep Blue’s chess victory over Garry Kasparov in the year 1997, IBM Watson “Jeopardy!” victory On Ken Jennings and Brad Rutter in 2011, and AlphaGo victory In Go over Lee Sedol in 2016.
The implicit subtext of these latest headlines is even more disturbing: AI is coming for your business. She’s as smart as your doctor, lawyer, and counselor you’ve hired. It portends an imminent and pervasive disruption in our lives.
But excitement aside, the comparison between AI and human performance tells us anything practically useful? How should we effectively use AI that passes the US medical licensing exam? Can he reliably and safely collect medical histories while the patient is taking? What about providing a second opinion on diagnosis? These types of questions cannot be answered by a human-like performance on the medical licensing exam.
The problem is that most people have little knowledge of AI – understanding when and how to use AI tools effectively. What we need is a clear, straightforward, general-purpose framework for assessing the strengths and weaknesses of AI tools that everyone can use. Only then can the public make informed decisions about incorporating these tools into our daily lives.
To meet this need, my research group turned to an old idea from education: Classification opens. First published in 1956 and later revised in 2001, Bloom’s Taxonomy is a hierarchy that describes levels of thinking in which higher levels represent more complex thinking. Its six levels are: 1) Remember – remember key facts, 2) Understand – explain concepts, 3) Apply – use the information in new situations, 4) Analyze – draw connections between ideas, 5) Evaluate – critique or justify a decision or opinion 6) Create – produce an original work.
These six levels are intuitive, even to a non-expert, yet specific enough to make meaningful assessments. Moreover, Bloom’s taxonomy is not tied to a specific technology – it applies to cognition broadly. We can use it to evaluate the strengths and limitations of ChatGPT or other AI tools that handle images, generate audio, or drones.
My research group began evaluating ChatGPT in terms of Bloom’s taxonomy by promptly asking them to respond to variations, each targeting a different level of cognition.
For example, we asked AI: “Suppose demand for COVID vaccines this winter is expected to be 1 million plus or minus 300,000 doses. How much do we have to stockpile to meet 95% of the demand?” – an application task. Then, we modified the question, asking it to “discuss the pros and cons of ordering 1.8 million vaccines”—an assessment-level task. We then compared the quality of the two responses and repeated this exercise for all six rating levels.
Preliminary results are helpful. ChatGPT generally works well with invocation, comprehension, and application tasks but struggles with more complex analysis and evaluation tasks. With the first router, ChatGPT responded well by application And to explain A formula to suggest a reasonable amount of vaccine (although a small arithmetical error was made in the process).
However, in the second case, ChatGPT was not convinced that there was too much or too little vaccine. It did no quantitative assessment of these risks, nor did it take into account the logistical challenges of cold storage of such a massive quantity and did not warn of the possible emergence of a vaccine-resistant variant.
We are seeing similar behavior for different claims across these rating levels. Thus, Bloom’s taxonomy allows us to derive more accurate assessments of AI technology than a comparison of raw human vs. AI.
As for our doctor, lawyer, and consultant, Bloom’s Taxonomy also offers a more nuanced view of how artificial intelligence may someday reshape—not replace—these professions. Although AI may excel at tasks of recall and comprehension, few people consult their doctor to tally all possible symptoms of illness, ask their lawyers to recite case law verbatim, or hire a counselor to explain Porter’s Five Forces theory.
But we turn to experts for higher-order cognitive tasks. We value our physician’s clinical judgment in weighing the benefits and risks of a treatment plan, the ability of our lawyer to set a precedent and advocate on our behalf, and the counselor’s ability to identify an out-of-the-box solution that no one else has thought of. These skills are analyzing, evaluating, and creating tasks, and levels of cognition where AI technology currently falls short.
Using Bloom’s Taxonomy we can see that effective collaboration between humans and AI will largely mean delegating lower-level cognitive tasks so that we can focus our energy on more complex cognitive tasks. Thus, rather than dwelling on whether AI can compete with a human expert, we should ask how well the capabilities of AI can be used to help advance human critical thinking, judgment, and creativity.
Of course, Bloom’s taxonomy has its own limitations. Many complex tasks involve multiple levels of categorization, which frustrates categorization attempts. Bloom’s taxonomy does not directly address issues of bias or racism, which is a major concern in large-scale applications of artificial intelligence. But while imperfect, Bloom’s taxonomy is still useful. It is simple enough for everyone to understand, general purpose enough to be applied to a wide range of AI tools, and structured enough to ensure a consistent and comprehensive set of questions about those tools are asked.
Much like the rise of social media and fake news requires us to develop better media literacy, tools like ChatGPT require that we develop our AI literacy. Bloom’s Taxonomy offers a way to think about what AI can do — and can’t do — as this type of technology becomes embedded in other parts of our lives.
Vishal Gupta is Associate Professor of Data and Operations Science at the USC Marshall School of Business and holds a courtesy appointment in the Department of Industrial and Systems Engineering.