Data quality of platforms and panels for online behavioral research

Andrew Gordon

|

13

April 2022

16

December 2021

Data quality of platforms and panels for online behavioral research

Andrew Gordon

|

13

April 2022

16

December 2021

Data quality is a top priority for Prolific. We're on a mission to empower great research, and we know that great research starts with data you can trust. The top reason researchers choose us is the quality of our data. We're so confident in the quality of our participant pool that we decided to test it against other major online recruitment platforms.

We wanted to go beyond the basic question of whether an online platform or panel is suitable for research (multiple studies have supported the notion that data collected via online platforms has satisfactory data quality - e.g., Chandler et al., 2019; Peer et al., 2017), to the more advanced question of which online platform can produce the best data quality, and on which aspects of data quality the leading platforms differ.

The full preregistered paper, published in Behaviour Research Methods, can be found here.

Findings at a glance

  • We conducted two large-scale (N ~ 4000) studies examining differences in attention, comprehension, honesty, and reliability between five online recruitment platforms: Prolific, MTurk, CloudResearch, Qualtrics, and Dynata
  • We found considerable differences between platforms, especially in comprehension, attention, and honesty
  • When data quality filters are not used Prolific provided significantly higher data quality on all measures compared to the other four platforms
  • When data quality filters are used we found similarly high data quality on both Prolific and CloudResearch. MTurk participants showed alarmingly low data quality regardless of whether filters were used

Read on 👇 for a full breakdown of what we did and what we found

Sign-up now to start collecting fast, high-quality data with Prolific and unleash your research superpower!

Defining data quality

To decide which specific data quality metrics we would investigate, we surveyed 129 academic researchers through the Society for Judgement and Decision-Making distribution list and also via our personal Facebook and Twitter accounts. We asked them to rate 11 aspects of data quality according to the importance they give them when choosing where to conduct their research online. The surveyed researchers deemed four aspects of data quality most important:

  • Attention: whether, and to what extent, participants devote enough time and attention to answering the questions
  • Comprehension: whether, and to what extent, participants seem to understand and follow the study’s instructions
  • Honesty: whether, and to what extent, participants provide truthful responses, and provide accurate responses when asked to self-report their performance
  • Reliability: to what extent participants provide internally consistent responses (not just responding randomly)
Graph displaying average rating given to the 11 aspects of data quality in our pre-study survey

The study design

We assessed data quality on three ‘self-service’ platforms (those that provide researchers complete control over the sampling and administration of their study): Prolific, MTurk, and CloudResearch, and two ‘panels’ (those that handle sampling and administration of the study on behalf of the researchers): Dynata and Qualtrics Panels.

We split our research into two studies that addressed data quality with and without the additional data quality filters provided by each platform.

Study 1: We did not apply any data quality filters on any of the sites

This study was conducted in order to examine the quality of the underlying sample on each platform, i.e., what is the baseline measure of data quality in the participant pool for each platform.

However, because CloudResearch samples from the MTurk pool, this made the samples of MTurk and CloudResearch conceptually similar. We therefore will refer to the MTurk sample obtained using the CloudResearch interface as “MTurk(CR)” for study 1.

Study 2: We included additional data quality filters that are available on each platform

We conducted this follow-up study to replicate the options that researchers may select to optimise the quality of their sample when running research, thus providing a more naturalistic comparison of data quality between platforms. This meant that we were not able to analyse data from Qualtrics or Dynata as these options are not available to researchers.

We applied the following screeners on all platforms - participants needed to have at least a 95% approval rating, and at least 100 previous submissions.

On CloudResearch we also used their option to “block low data quality workers". However, this option is not featured on Prolific as we strive to ensure that all of our pool provide high quality data.

In each study, all participants completed an online survey measuring the four key aspects of data quality that researchers deemed most important: Attention, Comprehension, Honesty, and Reliability. Below we outline how we measured these aspects and what we found for each.

We pre-registered both studies design and procedure, and these forms, along with all materials and data, are available at https://osf.io/342dp/

What were the results?

Attention

To assess attention we embedded two separate covert attention check questions to the survey in each of our two experiments:

Attention check 1

"This study requires you to voice your opinion using the scales below. It is important that you take the time to read all instructions and that you read questions carefully before you answer them. Previous research on preferences has found that some people do not take the time to read everything that is displayed in the questionnaire. The questions below serve to test whether you actually take the time to do so. Therefore, if you read this, please answer 'two' on the first question, add three to that number and use the result as the answer on the second question. Thank you for participating and taking the time to read all instructions."

"I would prefer to work at a job that pays higher rather than one that's closer to me and pays less"

  • Participants were asked to respond on a scale from 1 (Strongly Disagree) to 7 (Strongly Agree)

"I would prefer to live closer to my family even if it means I'd have less employment opportunities"

  • Participants were asked to respond on a scale from 1 (Strongly Disagree) to 7 (Strongly Agree)

Note: In experiment 2 we changed the correct responses to 6 and 3

Attention check 2

The question below was included as an item in the Need For Cognition scale (NFC; Cacioppo et al., 1984). In order to pass this attention check, participants needed to select 'Strongly Disagree'

Example of an attention check question

What did we find?

Attention Check Question results across different research platforms

In study 1, Prolific participants significantly outperformed all other samples on both attention checks individually and overall, with 69% passing both checks. The next highest were the other two self-service platforms, MTurk and MTurk(CR), with 48% and 51% passing both checks respectively, and then the two panels, Qualtrics and Dynata, each with 22% pass rates.

In study 2, both Prolific and CloudResearch participants passed the comprehension questions significantly more often than those on MTurk, with 94% and 86% of their samples passing both checks respectively, while only 68% of MTurk participants passed both. CloudResearch participants performed marginally better than Prolific participants once the data quality filters were switched on.

Comprehension

To assess comprehension we asked participants to repeat the instructions to two tasks presented within the survey in their own words to demonstrate understanding:

Task 1

We will show you an image with several people in it. Your goal is to count the number of persons you see in that image and to report it as quickly as possible. You will only have 20 seconds to observe the image and report your answer so please pay attention and answer carefully.

As we've explained before, this survey is about individual differences and how different people react to different situations. Every person can be different, so we expect to get different results from different people. Please feel free to provide us with any response you personally think is appropriate, in the other parts of the survey. In this part, though, we ask that you ignore the instructions given above and when you see the image with the persons in it you must report you see zero persons in the picture, even if that is not correct. Thank you for following our instructions.

Please explain, briefly and in your own words, the instructions above

Task 2

You will be asked to solve multiple short problems involving simple calculations.

Specifically, your task will be to look at a table that consists of several numbers and to find the two numbers that, when added up, result in exactly 10.

Example of a comprehension check question

For example, in the above table, the numbers 4.81 and 5.19 are the only ones that add up to exactly 10.

To make sure you understand the instructions, please summarize them briefly in your own words:

What did we find?

A graph displaying the results of the comprehension check question across research platforms


In study 1, Prolific users significantly outperformed the other samples on measures of comprehension with an average of 81% of participants providing correct summaries to the two questions. The two panel companies, Qualtrics and Dynata, performed relatively equal to each other with 52% and 51% of their sample providing the correct summaries respectively. However, only 45% and 42% of MTurk (CR) and MTurk participants provided the correct summary, demonstrating a significant lack of comprehension in these samples.

In study 2, both Prolific and CloudResearch significantly outperformed MTurk, with an average of 87% and 85% of participants passing both checks respectively, while only 59% of MTurk participants passed both checks. There was no significant difference between Prolific and CloudResearch, indicating that comprehension levels on both platforms were high and roughly equivalent.

Honesty

To assess honesty we used the matrix task detailed above (Mazar et al., 2008). In total, participants were presented with 5 different matrices of numbers, and their task was to find the two numbers that added up to exactly 10 within 20 seconds. Importantly, participants did not need to specify which two numbers added up to 10, only that they had, or had not, found them. Participants were also told that they would earn a small bonus reward for each problem they reported as solved (0.1 GBP on Prolific, 0.1 USD on all other platforms). In all cases, the fifth problem was actually unsolvable (i.e., there were not two numbers that added up to 10). Our measure of dishonesty was based on how many participants reported this unsolvable problem as solved.

In study 2 we included two unsolvable matrices, and we also included an 'imposter' question which asked participants if they met the criteria for a very well-paid follow-up study. Importantly however, the criteria were changed on a participant-by participant basis so that no participant would be eligible, and therefore any participant answering 'Yes' to this question was being dishonest.

What did we find?

Graph displaying results of honesty question in study one and study two

In study 1, 84% of the participants surveyed on Prolific were honest. We found that only around half (55% and 54% respectively) of the participants on MTurk and MTurk(CR) provided honest answers to the questions. For the other platforms, 69% of participants on Dynata, and 78% on Qualtrics provided honest answers.

In study 2, across the two unsolvable matrix problems, 71% of Prolific participants did not claim to have solved at least one, which was the highest honesty rating of the three platforms, with CloudResearch and MTurk showing 69% and 46% honesty within their samples respectively.

Graph displaying percentage of participants claiming false eligibility

On the imposter question in study 2, 60% of MTurk participants claimed false eligibility, compared to 55% on CloudResearch, and 48% on Prolific.

Reliability

In study 1 reliability was measured with two highly reliable and well-validated scales: the Need for Cognition scale (NFC; Cacioppo et al., 1984), which measures the extent to which respondents like to engage in and enjoy thinking, and the Domain-Specific Risk-Taking Scale (DOSPERT; Blais & Weber, 2006), which evaluates self-reported risk-taking behaviour. For study 2, we only used the NFC scale.

What did we find?

In both studies, reliability did not differ much between platforms when samples were analysed as a whole. However, significant differences were observed when we split each sample into high (passed both attention checks) and low (failed at least one attention  check) attention subsets.

In study 1 we found that consistency of responding was high across all sites with the notable exception of MTURK and CloudResearch, where, for low attention participants, the reliability was extremely low. In study 2, consistency of responding was high among all participants who passed both of our attention checks, regardless of the platform being used. For those who failed the attention checks, consistency of responding was still high on both Prolific and CloudResearch, but far lower for MTurk, consistent with the low reliability often reported for MTurk participants.

Overall Data Quality (DQ) score

Finally, to provide a more holistic view of data quality between platforms, we also computed an overall composite score of data quality based on the individually-measured aspects of attention, comprehension and honesty. To facilitate this analysis we gave each aspect a weight of 0.33 and summed them together for the composite score.

What did we find?

Graph displaying overall data quality score

In study 1, the Prolific sample showed the highest composite score, with the largest difference between sites pertaining to attention (where Prolific outperformed the other sites by 152% on average), followed by comprehension and honesty (where Prolific outperformed the other platforms by 140% and 131% on average, respectively). In study 2, both Prolific and CloudResearch samples showed the highest composite scores, and were not significantly different to one another. MTurk showed a considerably lower overall score, replicating the findings from study 1.

Conclusions

The findings of the study show that, amongst all platforms, Prolific provides data with the highest quality overall and on almost all measures individually when no extra data quality filters are used.

Prolific participants devoted more attention to questions, comprehended instructions better, and behaved more honestly (even when given the chance to cheat). These results extended previous findings that Prolific has superior data quality (Peer et al., 2017) and serve as a strong endorsement of the Prolific sample.

When data quality filters are used, our findings show that the sample that can be obtained from MTurk still offers data quality that is not only considerably inferior to that which can be obtained through Prolific or CloudResearch, but is also alarmingly low overall. When comparing Prolific and CloudResearch there was an advantage for CloudResearch in terms of attention, but an advantage to Prolific on all questions measuring honesty. However, the differences between the two platforms were small on all measures, so when filters are used, data quality from both platforms is similarly high.

It is worth highlighting, however, that data quality on Prolific with or without additional filters was high, whereas attaining a high level of data quality on CloudResearch involved additional filters that remain a 'pro' feature. Additionally,  there may exist a threat to particpant naivety when using more highly vetted samples due to the vetted pool necessarily being smaller, meaning that any random sample from this pool are more likely to be adept survey takers. Unfortunately, this was not a focus of the current study, but future research would do well to address the trade-off btween naivety and data quality in online participant pools.

Judging both from the responses to our pre-survey and our combined knowledge and experience, it appears that considerably more researchers use MTurk than Prolific or CloudResearch. Given the results of this experiment, this preponderance of research being conducted on MTurk appears to reflect a market failure and an inefficient allocation of scarce research budgets. Researchers using this platform (and the others highlighted in this work) should contemplate how important data quality is to them, and strongly consider transitioning off MTurk for their future research.

What does Prolific offer?

We believe that Prolific exhibits such high data quality because it has been built by and for academics, giving us a unique understanding of what is required to optimise the research process. We stand apart from the competition because of our commitment to data quality.

Since 2014, Prolific has enabled over 17,000 researchers to collect fast, high-quality data from over 200,000 participants. We offer a next-generation infrastructure for online research, allowing researchers access to over 200 pre-screening options that enable you to collect the sample you want, and seamless integration with all survey platforms.

Sign-up now to start collecting fast, high-quality data with Prolific and unleash your research superpower!