AI vs Human Perception: A Deep Dive into GPT-4 and Human Occupational Evaluations

The study compares GPT-4's evaluations of occupational prestige and social value in the UK with human respondents, finding strong alignment but notable overestimations by the AI. While GPT-4 is efficient for broad trends, it struggles with nuanced perceptions, especially among minority groups.


CoE-EDP, VisionRICoE-EDP, VisionRI | Updated: 09-09-2024 14:10 IST | Created: 09-09-2024 14:10 IST
AI vs Human Perception: A Deep Dive into GPT-4 and Human Occupational Evaluations
Representative Image.

Research by Paweł Gmyrek, Christoph Lutz, and Gemma Newlands, published by the International Labour Organization (ILO) in 2024, presents a comprehensive study on the differences in occupational evaluations between a large language model (LLM), GPT-4, and human respondents in the UK. The researchers, from the ILO, BI Norwegian Business School, and the Oxford Internet Institute, explore how GPT-4, a cutting-edge AI model, assesses the prestige and social value of 580 occupations in comparison to human respondents. The study sheds light on the strengths and limitations of using AI to capture societal opinions on labor markets and highlights the potential implications for the integration of LLMs into social science research and occupational evaluations.

Close Alignment in Occupational Rankings

The study found a significant correlation between GPT-4’s evaluations and those of human respondents, especially in terms of relative rankings of occupations. GPT-4 was able to approximate human judgments quite closely when it came to high-prestige professions such as cardiologists, judges, and general practitioners, which were ranked similarly by both the AI and human participants. The model also performed well in assessing the social value of professions like ambulance paramedics and nurses, which human respondents rated highly due to their critical roles in society. However, GPT-4 tended to overestimate both the occupational prestige and social value of many professions in absolute terms. For instance, while human respondents gave more moderate evaluations to certain professions, GPT-4 consistently assigned higher scores across a wide range of occupations. This trend was particularly evident for newer digital professions such as data miners and chatbot trainers, where the AI assigned significantly higher values than humans.

Struggles with Stigmatized and Illicit Jobs

One of the paper's key findings is that GPT-4, while efficient and accurate in capturing general trends, struggled with more nuanced evaluations. In particular, the AI model had difficulty accurately assessing occupations that are either stigmatized or not traditionally well-regarded, such as illicit jobs. For example, GPT-4 assigned a score of zero to occupations like online scammers and human traffickers, where human respondents still offered a small degree of variation in their assessments, despite the negative nature of these roles. GPT-4's simplistic approach to these roles, assigning a flat score of zero, demonstrates its limitations in understanding the complexities of societal perceptions.

Discrepancies in Demographic Groups

The paper also delves into the demographic factors that influence occupational evaluations. GPT-4’s scores were found to align most closely with the evaluations of white, middle-aged respondents, particularly men and women over the age of 25. However, there were notable discrepancies when it came to minority groups, particularly non-white respondents and younger age groups. These groups showed different occupational evaluations, reflecting their unique life experiences and societal positions, which GPT-4 struggled to capture accurately. The study attempted to adjust GPT-4’s prompts to better reflect these underrepresented groups, but these modifications did not result in significant improvements in the AI’s performance. This suggests that GPT-4, like other LLMs, may not yet be able to fully account for the diversity of perspectives present in a multicultural society like the UK.

Advantages of AI for Large-Scale Evaluations

Despite these limitations, the paper highlights the potential advantages of using AI models like GPT-4 for large-scale occupational evaluations. GPT-4’s ability to generate consistent and efficient data makes it a valuable tool for researchers looking to gather broad insights into societal trends. Its speed and cost-effectiveness are particularly beneficial when compared to traditional human surveys, which can be time-consuming and expensive to conduct. Furthermore, GPT-4’s capacity to generate detailed written explanations for its evaluations offers an additional layer of insight that human respondents are unlikely to provide consistently across a large number of occupations.

Caution in Using AI for Policy Decisions

However, the study emphasizes the need for caution in relying too heavily on AI models for occupational evaluations, particularly when it comes to policy decisions. While GPT-4 can provide useful insights into general trends, it may not be able to capture the full range of human experiences and perceptions, especially for minority groups. The paper warns that over-reliance on AI-generated data could lead to biased outcomes, particularly if the limitations of these models are not fully understood and addressed. For policymakers and researchers, the integration of AI tools like GPT-4 into social science research should be approached carefully, with an understanding of the model’s strengths and weaknesses.

The study underscores both the potential and the risks of using GPT-4 and other LLMs for occupational evaluations and social research. While the AI model demonstrates a high degree of accuracy in capturing general trends, its overestimation of occupational prestige and social value, as well as its difficulty in reflecting the perspectives of minority groups, present challenges. As LLMs continue to develop, their role in social science research is likely to grow, but for now, they remain a complementary tool rather than a replacement for traditional human surveys. The findings of this study provide valuable insights for researchers and policymakers looking to integrate AI into their work on labor markets and occupational evaluations.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback