Research Strand 9: Generative AI for Questionnaire Design

Strand Leader: Professor Patrick Sturgis, London School of Economics and Political Science

It is widely agreed that Generative Artificial Intelligence (GenAI) will transform conventional practice across the spectrum of service industries in the near future. It seems unlikely this will exclude survey research. Understanding how to capitalise on this potential is a key priority. The proposed research will assess the utility of GenAI – specifically, Large Language Models (LLMs) – for improving the quality and cost-efficiency of questionnaire design, evaluation and testing (QDET).

The research questions are:

RQ1 How effective are LLMs at a) applying question design and evaluation frameworks, and b) generating and analysing cognitive interview data, to identify problems with draft survey questions?
RQ2 What are the optimal ways of fine-tuning and prompting LLMs to match or exceed human performance of QDET tasks?
RQ3 Which procedures should be followed to ensure the use of LLMs in QDET tasks optimises survey quality, while complying with legal and ethical frameworks?

The project comprises two linked components: (1) Evaluating draft questions using rule-based QDET frameworks, (2) Generating and analysing cognitive interview data

In (1), we will select frameworks, determine desired outputs, and procure suitable training/validation data. For this, we will leverage existing corpuses of question evaluation and cognitive interview data held by our project partner Verian, and in publicly available archives. Zeroshot interactions with LLMs using the selected frameworks and prompt engineering will be assessed before implementing fine-tuning with question evaluation data. We will compare the improvement in performance between these strategies. In (2), we will develop ‘persona profiles’ for silicon sampling (matching characteristics of human participants), an interview protocol for generating synthetic cognitive testing data, and an analytic framework for extracting insights from responses. Commercial LLMs will be evaluated and compared on how they match, or improve on, humans undertaking these tasks. We will follow bestpractice procedures for generating and assessing synthetic responses to open-ended questions. We will use standard metrics of coder reliability to quantify the LLMs’ performance in (1), and in (2) assess the LLM-generated data relative to humans in terms of accuracy, and the speed and cost of the analysis.