Design and Validation of a Framework for the Creation of User Experience Questionnaires

and

and transfer the subjective impressions of those users into a numerical scale value that describes how the corresponding UX quality of the product is perceived inside the target group.
This ability to measure the user experience of a product quantitatively is quite important for several typical questions in product evaluations [9]. First, it allows a direct comparison of different products or different design variants of a single product concerning their UX. Second, it can be used to continuously monitor the UX quality of a product over time. Third, it allows setting objective goals concerning UX by defining a threshold for the mean values of the scales of the questionnaire, which should be reached over time. Fourth, the comparison of the evaluation results of a product with a benchmark allows deciding if the UX quality of the product fulfils general user expectations [10].
User experience is a complex product characteristic [11] that results from the perception of many distinct quality aspects. These are classical task-related UX qualities, for example, efficiency of use, ease of learning, controllability, error tolerance [12], intuitive use [13], visual complexity [14], usefulness [15], or non-task related UX aspects like, fun of use [16], identity [17], aesthetics of the visual design [18], novelty of the product concept [19] or content quality [20].
However, not all of these UX aspects are of relevance for every single product [21] [22]. The importance of such UX aspects can vary widely between products supporting different tasks and use cases.
For a simple self-service application, e.g. creating a leave request or an application to change personal data of an employee, it is crucial that it can be used intuitively, i.e. without asking for help of another person or reading a lengthy manual. Such applications are used quite infrequently, and we cannot expect that the user will remember how to use the application between two usage points. Because of the rare usage frequency, efficiency does not play a role here. An unnecessary click does not hurt much if an application is used only once in a month or even less frequently.
For a business application, for example an application to create sales orders or service requests, that is used repeatedly during a typical work day, things are completely different. Intuitive use is nice to have, but not crucial. Typically, a learning period is required for such applications to understand the use case and the mapping of realworld processes and tasks to the elements and flows of the application. Therefore, some learning is acceptable and expected by users. In addition, due to the heavy usage during a typical work day, efficiency is key for these types of applications, i.e. an unnecessary click really hurts, when it needs to be repeated 50 times a day.
The huge number of existing UX aspects and the different levels of importance for different types of products explains the high number of different UX questionnaires that are available, for example SUS [23], SUMI [24], UEQ [19], VISAWI [25], meCUE [26] or ISOMETRICS [27], just to name a few. Each of these questionnaires realizes by its scales a different set of measured UX aspects. For example, SUS only measures overall usability and the items in this questionnaire address mainly Learnability and Efficiency of use. VISAWI measures the visual appeal of a product by 4 subscales (Simplicity, Diversity, Colorfulness, Craftsmanship). The UEQ measures 6 distinct UX aspects (Attractiveness, Efficiency, Perspicuity, Dependability, Stimulation, Novelty). The ISOMETRICS contains the quality aspects described in the ISO 9241 -210 as scales. Thus, what is actually measured differs heavily between different UX questionnaires.
Of course, none of these questionnaires contains all UX aspects discussed in research literature, since this would increase the length of the questionnaire above any reasonable limit.
For a UX researcher evaluating a concrete product, this can cause some problems. If he or she has narrowed down which UX aspects are important for the users of the concrete product and should be thus measured in the evaluation, it can easily happen that no one UX questionnaire exists that contains exactly those UX aspects as scales. Sometimes, it is possible to combine several UX questionnaires to cover all relevant aspects, but usually this is also not optimal, since different questionnaires often have different item and answer formats. This makes it difficult for participants to fill out the questionnaires and makes it harder to compare the scale means obtained from different questionnaires.
In this paper, we try to address this dilemma by introducing a modular framework that allows the researcher to select the relevant UX aspects out of a larger catalogue of UX scales. All UX scales have a common item and answer format and can thus easily be combined to create a UX questionnaire fitting to the research question behind a product evaluation.

II. Previous Work in the Field
The UEQ+ framework is based on some earlier work which we describe here shortly to make the connection transparent.
In [21] [22] it is investigated how important different UX aspects (for example, Efficiency, Stimulation, Trust, Aesthetics) are for certain types of products (for example, social networks, word processing, programming tools, web sites, messengers). The study uses 16 UX aspects extracted from research papers and from an extensive study of the scales used in existing UX questionnaires. Participants of the studies rated the importance of these UX aspects for 15 product categories. Both studies found some clear dependencies between the different product types and the importance ratings for the UX aspects.
Based on the results, it is possible to provide a recommendation on which UX aspects are important for a product category and should therefore be measured in UX evaluations of product of this type [7]. The UX aspects investigated in these studies are good candidates for a framework that should be able to help synthesize UX questionnaires.
Follow-up research [28] shows that quite similar importance ratings are obtained in the context of another culture (Indonesia). The importance of an UX aspect for a type of product thus seems to be mainly a result of the characteristics of the product and not so much by cultural aspects.
The User Experience Questionnaire (UEQ) is an established and widely used UX questionnaire. It already contains the 6 UX scales Attractiveness, Efficiency, Perspicuity, Dependability, Stimulation and Novelty [19]. Scales are represented by 4 items (except Attractiveness which contains 6 items) that represent two terms with opposite meanings, for example: Thus, the UEQ is a semantic differential with a 7-point Likert-scale for the answers. The simple item format seems to be quite suitable to define additional scales. This was already used by some authors to define extension scales for some special product types. In [29], a scale to measure Trust was defined. This UX aspect is, for example, highly relevant for online banking applications or web shops.
For household appliances there are also quite specialized UX requirements that strongly influence the overall impression of a product. In [30], two scales for the sounds caused by the operation of a household appliance and for the haptic feeling were developed.
Due to the item format and the fact that a number of scales in a common format already exist, it was decided to base the framework on the UEQ. To make this connection evident, the name UEQ+ was chosen for the framework.

III. Changes in the Item Format
Due to the requirement that it should be possible to combine scales depending on the examined product type, some changes concerning the item format are necessary. We assume that the researcher can freely decide which combination of the available scales he or she wants to use. In addition, the order in which the scales appear in the final questionnaire is up to the researcher.
In the UEQ, the order of the items is randomized. This is also true for Some studies (currently unpublished) showed that the polarization of the items does not influence the UEQ scale means (see also [30]), so we decided to use a common scheme with the negative term left and the positive term right for the UEQ+ scales.
Since it should be possible to combine scales in an arbitrary order, and some of the terms are quite similar or even identical in the different scales, it was necessary to group all items of a scale together and set some context for the correct interpretation of the terms. This is done by introducing a short sentence that is shown on top of the items of a scale and that somehow set a context for the common interpretation of the items.
Thus, a scale in the UEQ+ has the following format (as an example we present the scale Efficiency): To achieve my goals, I consider the product as organized I consider the product property described by these terms as Thus, we have the statement that connects the items of the scale, then the 4 items with the negative term on the left and the positive term on the right and a final rating concerning the importance of the scale for the overall UX impression of the product. We describe the role of this importance rating at a later point in detail.

IV. Creation of Additional Scales
The UEQ already contains 6 suitable scales that were simply adopted into the UEQ+ (for the scale Attractiveness, two of the 6 items were removed to have 4 items for all scales). The same is true for the already available extensions for Trust, Haptics and Acoustics.
The list of UX aspects from [21] [22] was reviewed and the following UX aspects were selected for scale creation: Aesthetics, Adaptability, Usefulness, Intuitive Use, Value, and Content Quality.
Two experts then constructed for every UX aspect a set of items in the UEQ format which describe the aspect semantically. Item suggestions were jointly discussed and consolidated.
In an empirical study, 192 subjects (students that participated on a voluntary basis) rated several products with the corresponding lists of candidate items on a 7-point Likert scale. The average age of the participants (119 male, 73 female) was 30.42 years.
The resulting data were then analysed by principal component analysis. The analysis was done by the function principal of the R package psych [32]. It was first checked if a one-dimensional solution fits well to the data (which should be the case due to the fact that all items in a candidate set describe the same UX aspect).
We show as an example the candidate set and analysis for the UX aspect Beauty. A description for the data analysis for all scales can be found in [33].
The set of candidate items was given as: ugly / beautiful, lacking style / stylish, unappealing / appealing, ugly in colour / beautiful in colour, inharmoniously / harmoniously, unpleasant / pleasant, not artistically / artistically, thoughtlessly / thought out.
The scree plot of the principal component analysis (see Fig.  1) clearly shows that a one-dimensional solution fits the data well. Proportion of variance explained is 0.64. The fit based upon off diagonal values is 0.99 (values > 0.95 indicate a good fit). The corresponding loadings of the items on the factor are shown in Table I. Thus, the 4 items with the highest loadings (highlighted in bold in Table I) were selected to form the new scale Beauty.
If the one-dimensional solution fits the data well, we choose as in this example the 4 items with the highest loading on the factor as representatives for the new UEQ+ scale. This was the case for all UX aspects with the exception of Content Quality (see [33]).
For Content Quality, a two-dimensional solution fits the data better (see Fig. 2), i.e. there are two different dimensions detected in exploratory principal component analysis. Since the two detected factors could be interpreted, we decided to split this UX aspect into two scales Trustworthiness of Content and Content Quality.   The items loading high on the first factor express trust in the correctness of the provided information. Items loading on the second factor cover semantically the actuality and quality of the information. Thus, we named the two factors Trustworthiness of Content and Content Quality.

V. Scales Included in the UEQ+ Framework
The UEQ+ framework currently offers the following UX scales; we show here only the scale names and a short description of the semantic meaning of the scale. The items per scale are listed in Appendix 1.
• Attractiveness: Overall impression from the product. Do users like or dislike the product?
• Efficiency: Users have the impression that they can complete their tasks without unnecessary effort.
• Perspicuity: Subjective impression that it is easy to get familiar with the product. It is easy to learn how to use the product.
• Dependability: The user has the impression that he or she controls the interaction.
• Stimulation: Feeling that it is exciting and motivating to use the product.
• Novelty: Feeling that the product is innovative and creative. The product catches the interest of the user.
• Trust: Subjective impression that the data entered into the product are in safe hands and are not used to the detriment of the user.
• Aesthetics: Impression that the product looks nice and appealing.
• Adaptability: Subjective impression that the product can be easily adapted to personal preferences or personal working styles.
• Usefulness: Subjective impression that using the product brings advantages, saves time or improves personal productivity.
• Intuitive Use: Subjective impression that the product can be used immediately without any training, instructions or help from other persons.
• Value: Subjective impression that the product is of high quality and professionally designed.
• Trustworthiness of Content: Subjective impression that the information provided by the product is reliable and accurate.
• Quality of Content: Subjective impression if the information provided by the product is up to date, well-prepared and interesting.
• Haptics: Subjective feelings resulting from touching the product.
• Acoustics: Subjective impression concerning the sound or operating noise of the product.

VI. Importance Rating and KPI
In some use cases it is beneficial to measure not only the means for the different scales, but to provide also a single number (a key performance indicator, or KPI) that summarizes the single scales and can be interpreted as a measure for the overall impression concerning UX.
An extension to calculate such a KPI for the UEQ is described in [34]. The same principle is used for the calculation of a KPI for the UEQ+. The basic idea is to calculate per participant the weighted sum of the scale means with the relative importance ratings. The KPI is then the average of these values of all participants. For the exact formula of the calculation please refer to [34].

VII. First Validation Studies
To evaluate the scale quality, the three product categories Web Shops, Video Platforms and Programming Environments were selected. Two products popular in Germany were chosen per product category (Web Shops: Otto.de, Zalando.de; Video Platforms: Netflix, Amazon Prime; Programming Environments: Eclipse, Visual Studio).
For each product category, a specialized UX questionnaire containing the scales that seemed to be most important for products of this category (see [22] for details) was constructed. Participants were recruited per e-mail campaigns and by links posted to web sites. Each participant had the choice to rate one product that he or she used regularly from one of the product categories, thus we have different numbers of ratings for the different products (see Table III). Please note that 4 items and the importance of the scale must be rated for every scale. Thus, for 8 scales this requires 40 clicks. In addition, the overall satisfaction must be rated, and two clicks are required to state age and gender.
Thus, filling out the corresponding questionnaires seems to not require much effort from the participants. They spend around 4 minutes (= 240,000 milliseconds) in answering the questions and in addition selected answers seem to not have been changed too often afterwards. This indicates that the used terms are not problematic or difficult to understand.
Tables IV, V and VI show for each product category and evaluated product the scale mean, standard deviation and the Cronbach Alpha coefficient. The scale means (see Table IV) are, with the exception of Trust, lower for otto.de than for zalando.de. That there is no difference for Trust is quite natural, since both shops are quite established shops with a longer history. The scales obviously allow to differentiate between different products. Except for Trustworthiness of Content (see Table V), the ratings are higher for Netflix than for Amazon Prime, which is not unexpected since the source of content of both tools is quite similar concerning trustworthiness. Again, the other scales differentiate between the two products. Visual Studio ratings (see Table VI) are for all scales much higher than the ratings for Eclipse. It must, however, be noted that we had only a small number of participants for programming environments, thus these results need to be interpreted with care.
In general (see Table IV The observed ratings for the importance of the scales confirm that the selected scales were considered as important for the evaluated products by the participants. Detailed values of the importance ratings and some additional information concerning the scale means can be found in [33].
As described above, it is possible to calculate a KPI using the scale means and the importance ratings of the scales. This KPI is interpreted as an indicator for the overall satisfaction concerning the UX of the product. To verify this assumption, each online questionnaire contains as one last point the item: Overall, concerning the user friendliness of <Product> I am Table VII shows the correlation of the ratings of this item to the calculated KPI. Correlations between the satisfaction ratings and the calculated KPI are quite high. Thus, our interpretation of the KPI seems to be valid. In addition, since the correlation seems to be quite stable over different products and combinations of scales (each product category was evaluated with different sets of UEQ+ scales), it may be possible to develop a benchmark for the KPI that can be used independently of the selected scales for an evaluation.

VIII. Advantages and Disadvantages of a Modular
Construction of UX Questionnaires The big advantage of the UEQ+ is that it allows researchers to create UX questionnaires perfectly adapted to the research question, i.e. such a questionnaire contains exactly the scales that need to be measured. In addition, all scales follow a uniform item format, which makes it easy for the participants to answer the items.
But such a modular approach is not without its shortcomings. There are some disadvantages compared to using a standard questionnaire like the UEQ out of the box.
Obviously, the effort to set up the questionnaire is higher. An application of the UEQ+ requires that the researcher have a clear picture concerning the UX aspects that are relevant for the product and should therefore be measured. There are some recommendations available that show how important different UX aspects are for different types of products (see [22] and [7]). In addition, the UEQ+ handbook (can be downloaded from www.ueqplus.ueq-research.org) contains some detailed suggestions concerning the most relevant UEQ+ scales for several typical product categories. But of course, not all products will fall into one of the described product categories in these papers and it must be checked if there are maybe exceptions for a specific product.
Most standard questionnaires offer some tools for data analysis. Thus, it is sufficient to collect the data, drop it to the tool and not all, but many important analyses are done automatically. We also offer a data analysis tool (can be downloaded free of charge from www. ueqplus.ueq-research.org), but since the scale structure of the resulting questionnaire is not fixed, this tool only provides limited support.
In addition, interpretation of the results is a bit harder in the UEQ+ than in standard questionnaires. What does a scale value of 1.3 for a scale mean? Is this a good, medium or bad value compared to other products? Standard questionnaires, for example the UEQ itself [19] or the SUS [23] or VISAWI [25], offer large benchmark data sets that are based on evaluation results for larger sets of different products. Thus, a simple comparison of the result obtained in an evaluation to the results in the benchmark data set offers some insights concerning the question of how good or bad the impression of users towards the product is compared to other available products.
For the scales from the UEQ, such a benchmark is available, for the newly added scales this is at the moment not the case. For some frequently used scales this situation may change, but some of the scales are obviously only relevant for special types of products, so it may take a long time until a benchmark in the quality of the UEQ benchmark will be available for all scales of the UEQ+.

IX. When to use UEQ+?
Given the remarks concerning the advantages and disadvantages of a modular questionnaire, it is possible to give some recommendations.
If you are setting out to evaluate a single product and your main research question is to get an idea about the UX quality of this product, you should use the UEQ. Even if some of the scales do not perfectly match your product or if some scales that you think are important are missing, the availability of the UEQ benchmark and the ease of use of the available material, like the data analysis tool, would clearly speak for using the UEQ.
If you are planning to evaluate the same product multiple times, for example to get an insight if the product improves over time, and if the UEQ scales do not capture most of the UX aspects you consider relevant, then opting for your own special questionnaire built with the UEQ+ is the better choice. In this scenario, the lack of a benchmark is not a big issue, since you are mainly interested in comparing multiple measurements of the same product over time. Thus, capturing the UX quality in an optimized form is more important here.
If you want to set up an UX measurement as part of your quality process for a larger suite of similar products (in the sense that the same UX aspects apply to all of them) and if the scales of the original UEQ do not fit well to your needs, then it is also recommended to set up your own questionnaire using the UEQ+. In this case the additional effort required is minor, since you do this only once and reuse it in a large number of concrete evaluations. In addition, the lack of a benchmark is not so important, since over time you will generate your own data set of evaluations that will help to interpret then the results obtained for a single product, i.e. in such a scenario you will quickly generate enough data yourself.

X. Conclusions and Further Work
We described the development of a modular framework for the creation of UX questionnaires. This framework allows the researcher to select the UX aspects that are relevant for a certain product from a list of existing UX scales. Thus, a customized questionnaire containing exactly those UX scales that are important for the users of the product can be created.
Currently, the UEQ+ framework contains 16 scales. Of course, they do not cover the entire concept of UX. Other scales may be required for some products and new use cases and product types entering the market in the future will create the need for different, notyet-considered UX scales. Thus, a framework like the UEQ+ is always a work in progress and at no point in time will it be truly finished. We will try to provide some additional scales in the near future and hope that other researchers will (as they did already by constructing some of the extension scales for the UEQ) help to provide new scales, which we can then integrate into the UEQ+ framework.
Another important area of future work is the improvement of the existing benchmarks. This simply requires time to collect sufficiently large sets of data.
Six of the UEQ+ scales are concerning their items identical to the original UEQ scales. However, the item format is slightly changed. Items of a scale are grouped in the UEQ+ and the positive term is always right (in the original UEQ items appear in random order and polarity). In addition, a statement has been added that described the common meaning of all items in a scale. It is currently not fully investigated if these changes have an impact on the results, i.e. if the scale means obtained from the UEQ+ scales are fully comparable to the scale means of the corresponding UEQ scales. We expect only minor deviations, but this must be of course evaluated in further studies.
Currently, the items for the extension scales of the UEQ are available only in German and English. The six scales taken over directly from the UEQ are available in more than 20 languages (see, for example, [35] [36] for the description of the Spanish and Portuguese language versions). Of course, we hope to provide some more translations in the future.

Appendix I
In the following we present the complete list of scales and items available in the UEQ+ framework.

Attractiveness
In my opinion, the product is generally:

Trustworthiness of Content
In my opinion, the information and data provided by the product are: • useless / useful • implausible / plausible • untrustworthy / trustworthy • inaccurate / accurate

Quality of Content
In my opinion, the information and data provided by the product are: • obsolete / up-to-date • not interesting / interesting • poorly prepared / well prepared • incomprehensible / comprehensible

Acoustics
The noise during use of the product is: • loud / quiet • dissonant / melodic • booming / dampened • piercing / soft

Haptics
In my opinion, the surface of the product is: • unstable / stable • unpleasant to the touch / pleasant to the touch • rough / smooth • slippery / slip-resistant