Quality Assurance in Big Data Engineering-A Metareview

With a continuously increasing amount and complexity of data being produced and captured, traditional ways of dealing with their storing, processing, analysis and presentation are no longer sufficient, which has led to the emergence of the concept of big data. However, not only the implementation of the corresponding applications is a challenging task, but also the proper quality assurance. To facilitate the latter, in this publication, a comprehensive structured literature metareview on the topic of big data quality assurance is presented. The results will provide interested researchers and practitioners with a solid foundation for their own quality assurance related endeavors and therefore help in advancing the cause of quality assurance in big data as well as the domain of big data in general. Furthermore, based on the findings of the review, worthwhile directions for future research were identified, providing prospective authors with some guidance in this complex environment.


Introduction
With the ongoing surge in data production [1], [2], the possibilities for gaining new insights and creating new knowledge based on those data are also tremendously increasing. Yet, with those theoretical opportunities also comes the challenge of actually realizing those potential gains. This is, however, a highly complex task, because traditional ways of storing, processing, analyzing or presenting data are oftentimes no longer sufficient to keep up with the new demands that accompany the development of the available inputs and desired outputs [3]. For this reason, new concepts, techniques, tools and tactics have emerged that are focused on dealing with huge amounts of diverse data, while adhering to time constraints [4]. In conjunction of the data itself, they are amalgamated under the term big data, with the corresponding data investigation for the purpose of gaining knowledge and insights being denoted as big data analytics (BDA). To provide value, those findings have to be converted into actions, which will then, for instance, help in winning customers, finding new product ideas, reducing maintenance costs, increasing the efficiency of an organization or predicting future developments [5], [6]. Correctly harnessing big data can therefore demonstrably increase an organizations performance [7]. Though, to warrant those benefits and also create trust in the designated users, the BDA applications have to be of high quality, to assure the validity of their outputs and facilitate their actual implementation [8], [9]. As with other products and services, this necessitates a rigorous quality assurance as part of the creation process, which is in this case are called big data engineering [10]. However, this aspect is often somewhat neglected [11], even though low quality BDA can severely impact the results in a negative manner [12].
If it comes to the scientific realm, there is a plethora of different approaches, tools, and suggestions on how to facilitate the quality assurance of big data application. Those include, inter alia, benchmarking suites [13], test data generators [14], ETL testing [15], ontology-based testing [16], metamorphic testing [17], the application of test driven development [18] or the simulation of databases [19]. Though, while this variety appears to be beneficial on the one hand, it can also cause disorientation and can turn into a distraction when attempting to get an overview of the larger picture. However, such an overview can be a highly valuable foundation for future research endeavors, which can build upon the existing knowledge instead of just involuntarily repeating it and which can also be motivated and guided by research gaps that have been previously identified [20]. To facilitate this process, literature reviews are usually the means of choice. Yet, those are usually limited on certain aspects of a topic, since the vastness of the available content would otherwise be overwhelming. When attempting to get a more comprehensive view of a domain, it is therefore sensible to conduct a metareview, which brings together the findings of those individual reviews. In doing so, not only a broader outline of the domain can be given, but also biases, which might be inherent to certain studies, can be somewhat compensated for.
To provide such a comprehensive view into the domain of quality assurance in big data engineering, this article seeks to answer the following research question (RQ) by conducting a metareview.

RQ: What is the current state of the art in the domain of quality assurance in big data engineering?
To compile the desired answer, the RQ is divided into several sub research questions (SRQ), which each deal with a certain aspect of the big picture. The remainder of the article is structured as follows. Succeeding this introduction, in Section 2, the topic of big data is delineated more in-depth, to provide a clear understanding of the subject of discussion. Afterwards, in Section 3, the review protocol is outlined, followed by a presentation of the findings. Those are subsequently discussed in Section 4. Finally, in Section 5, a conclusion of the presented work is given, including a discussion of the study's limitations and avenues for future research.

Big Data
The domain of big data is highly complex and subject to numerous publications, focusing on a broad variety of aspects [21]. While it is neither possible, nor necessary, to comprehensively describe all of them to answer the RQ, a brief overview of the most relevant big data characteristics, its general application and the corresponding quality assurance is given to provide a solid foundation for the ensuing review process.

Application
Since many organizations across various fields of activity can benefit from an improved decision making, the use of BDA has reached a variety of domains with highly diverse requirements and contexts. This list comprises, for instance, civil protection [29], healthcare [30], agriculture [31], manufacturing [32], sports [33] and the steering of urban infrastructure [34].
To turn the relevant occurrences in real life into a useful asset for guiding or even automatically triggering actions, there is a usually a complex chain of activities required. At first, the data have to be acquired. This could, for instance, happen through sensors, which monitor certain events and statuses, people inputting data, the repurposing of already existing and available data or the sourcing from external origins. Furthermore, those data have to be stored, even though, it is not always necessary to permanently keep them in storage, in many use cases historic data can be of value. Another common building block of BDA applications is the pre-processing of the data. Here, it is attempted to mitigate potential data quality issues, as well as to account for the data variety. The former means that, for instance, missing data, outliers, typos or invalid entries are accounted for, by either correcting, interpolating or excluding the respective entries, depending on the specified policy. The latter refers to the fusion of data whose amalgamation makes sense from a content perspective but is hampered by different structures. An example for this would be the use of varying formats for the date specification, depending on the regarded country. While the data might be otherwise compatible, they can only be appropriately joined when unifying the data representations. Once the data are prepared, the actual analysis can ensue. Here, it can be differentiated between descriptive (how was the past), diagnostic (why is something the way it is), predictive (how will the future look like) and prescriptive (what should be done) knowledge [35]. Depending on the use case, a variety of approaches and algorithms can be applied, ranging from the creation of rather simple statistics to highly complex machine learning algorithms. Finally, the results of the analysis have to be either directly converted into action, for instance, by automatically reconfiguring a machine, or they have to be visualized to present the underlying information in an adequate way that allows the users to obtain the desired insights.

Quality Assurance
Because the benefits of harnessing big data can only be reaped when the quality in the whole process chain is adequate [8], it is obviously necessary to assure that this is the case [36]. However, BDA is reliant on the interplay of several aspects, rendering it a multidimensional endeavour [37]. Consequently, those facets are also relevant when it comes to the quality assurance. Regarding the BDA application itself, there are two aspects that have to be considered. On one hand, it is important to test that a system actually provides the desired functionality without the algorithms or the internal communication being erroneous, which might skew the results (e.g., if the data import from certain sources is erroneous, the data might be completely ignored or interpreted incorrectly) or even entirely prohibit their use and, on the other hand, it is also important to have a sufficient performance to keep up with the incoming workload as well as the corresponding requirements. For the latter, oftentimes benchmarks are used to warrant comparability of the results. They can, for instance, provide insights into the resource consumption, the processing speed or the maximum capacity of a solution. Thus, related issues can be identified and subsequently addressed.
Moreover, both approaches usually have to be repeated when the regarded system is changed, due to additions or modifications. Checking not only the new part of the system, but also the already existing parts, assures that they are not negatively impacted by the change. This practice goes by the name "regression testing". It is especially important in large and highly complex systems, since these systems can easily become incomprehensible, which makes it hard to keep track of all dependencies and possible side effects. Besides the application's actual implementation, the quality of the used data plays an important role [38], with the same applying to the aptitude of the personnel dealing with the systems, respectively being responsible for the corresponding decision making [12]. If the data that an analysis is based on are improper, wrong, insufficient or incomplete, the results will also be flawed. However, even if the analysis itself is generally well implemented, but its focus is not suited to facilitate solving the underlying issue of the concerned organization or if the results of the analysis are ignored, respectively skewed in order to support a certain position or strategy, the benefit is heavily reduced. However, since the RQ is focused on big data engineering, the data quality, the human component with respect to the solution's use and also the monitoring of already running solutions are not relevant. Therefore, in the following review protocol, only the testing and the benchmarking of BDA applications are considered.

The Review
To answer the RQ, a structured literature review (SLR), oriented on the methodologies proposed by Webster and Watson [39], Levy and Ellis [40], and Okoli [41] is conducted. Since such reviews are not supposed to be an end in itself, but a valuable tool to the research community as a whole, it is necessary to assure rigor. Furthermore, a comprehensive description of the process that is as detailed as possible should be given to enable others to retrace and evaluate the undertaken steps, so they can judge the review's value for their own work and possibly also build upon it [20].
Though, unlike in common SLRs, in a metareview, the focus is not on primary studies but on existing reviews concerning a topic, which are collated, analyzed and brought together to generate new insights and provide a broader overview of the underlying subject. However, apart from that, the general methodology remains the same.

Protocol
To start the whole search process, which is depicted in Figure 1, a set of promising databases to be considered was determined. Since the goal is to provide a comprehensive overview of the domain, this list should be rather extensive, to reduce the risk of missing out on relevant publications. For the presented review it was also not discriminated between meta databases and publisher bound databases and instead, both types were included, further increasing the likelihood of finding all the suitable contributions. Following this line of thought, ACM Digital Library, AISeL, Emerald Insight, IEEE Xplore, Mary Ann Libert, Sage, Science Direct, SciTePress, Scopus, Springer Link, Taylor & Francis and Wiley were included into the search process. While it was initially also planned to include the repository of IGI publishing, the search engine did not allow for a sufficiently elaborate filtering, therefore, it was dropped. Based on the starting date of the literature review and to increase traceability by only considering completed years, the regarded timeframe was everything before 2021 and, as depicted in Table 1, only conference publications and journal articles were considered, since those are peer-reviewed, assuring a certain degree of quality. Since the RQ focuses on getting an overview of quality assurance in big data engineering, this comes down to testing and benchmarking, which is reflected in the search terms that include big data as the regarded domain, relevant synonyms for overview and test, benchmark or quality. Besides the general process, Figure 1 also shows the specific number of findings for each search engine after accounting for the duplicates that resulted from the multiple queries that had to be conducted for SciTePress and Springer Link. Title: "Big Data" AND ("Review" OR "Survey" OR "Overview" OR "State of the Art"); Anywhere: Quality OR Test* OR Benchmark* Journal or Conference IEEE Xplore ("Document Title": "Big Data") AND ("Document Title": "Review" OR "Document Title": "Survey" OR "Document Title": "Overview" OR "Document Title": "State of the Art") AND ("All Metadata": Test* OR "All Metadata": Quality OR "All Metadata": Benchmark*) Title: "Big Data"; Title: "Review" OR "Survey" OR "Overview" OR "State of the Art"; Title/Abstract/Keywords: Quality OR Test* OR "Benchmark* With the included search engine's filters not necessarily being identical, it was not possible to use the exact same search terms and settings for all of them. However, the conditions of the search were kept as similar as possible. The applied mapping of databases and search terms is shown in Table 1. As shown there, due to its specifics, the searches in Springer Link and SciTePress had to be split in several parts, which were later on merged again. Furthermore, while most search engines support wildcard symbols (*), Science Direct does not, which leads to the minimal deviation from the Scopus search term.
The obtained papers were subsequently merged in two sets, one containing conference publications and the other one journal articles. This resulted in 1485 entries for the former category and 1365 for the latter. However, those numbers already take the previous removal of duplicates within the same search engine (this applies to Springer Link and SciTePress) into account. Afterwards, those collections were cleansed from duplicates in their entirety, leaving 1434 unique conference papers and 1337 articles. Those papers were in the following filtered, using the inclusion and exclusion criteria presented in Table 2. While all the inclusion criteria had to be met for a paper to be accepted, the fulfilment of any exclusion criterion led to its dismissal. In a first step, the papers' titles were read and those deemed unsuitable discarded. This already led to a reduction to 56 conference papers and 32 journal articles. To somewhat validate the comprehensiveness and also potentially add additional inputs, DBLP, Google Scholar, ResearchGate and Semantic Scholar were used, which were not considered for the initial search, since they tend to deliver numerous but oftentimes also irrelevant results. For those, the search parameters depicted in Table 3 were applied. While those might not provide full coverage, it should give at least a general idea, if the previous searches are majorly incomplete, or, in case of only limited additions, adequate. Regarded the first 100 entries, since those are already sorted by relevance.

ResearchGate
"Big Data" AND ("Review" OR "Survey" OR "Overview" OR "State of the Art") AND (Quality OR Test* OR Benchmark*) Regarded the first 100 conference papers, articles or book chapters. The latter because sometimes conference papers are considered book chapters by ResearchGate.
Those further searches resulted in the addition of two more journal articles to the regarded body of literature. While it shows that the initial search did not provide complete coverage, the relatively low number also indicates, that the search was already very comprehensive. In a next step, the gathered publications' abstracts and keywords were read, resulting in the remaining of 13 conference papers and five articles. After reading the introduction and conclusion, the set was reduced to seven contributions from conferences and five from journals. In a last filtering step, those were skimmed in their entirety, leaving five conference papers and four journal articles as the final set of reviews for answering the RQ, with no additional findings when conducting a backward search in their respective reference sections. Those numbers therefore constitute the first half of the answer to SRQ 1 and are also relevant concerning SRQ 2.

Findings
The set of relevant papers, resulting from the conducted review, is shown in Table 4, which is, therefore, answering SRQ 1. Furthermore, with the added information concerning the publication outlet and the current citation score, also SRQ 2 is answered. The depicted number of citations for each paper was queried in Google Scholar on February 18, 2021. In the following, each of the nine papers is briefly presented, thereby addressing SRQ 3. In [42], which is the earliest paper considered in this metareview, 18 benchmarks for measuring the performance of varying tools for (big) data management are presented and compared. The focus is on different types of databases and specific technologies like Hadoop or MapReduce. Furthermore, to account for big data's inherent variety that also heavily influences the corresponding systems, the authors state the necessity to not only benchmark just end-to-end or component based, but to combine both approaches to get a more comprehensive assessment of the evaluated systems. For the same reason, the selection of the data used for benchmarking to match real world workloads instead of just using SQL queries to gauge a system's performance is highlighted as a necessity. Additionally, it is made a plea for considering scalability, energy efficiency, fault tolerance (by purposefully injecting failures into the target system) and also security aspects on top of the common performance metrics, when benchmarking big data systems.
One year later, [43] directed their attention to the requirements and challenges concerning the generation of data used for benchmarking big data applications. They point out the importance of a benchmark being application-specific and subsequently the need to identify typical workload behaviors as a foundation for the actual performance evaluation. Besides that, they state several requirements for a successful benchmarking test, namely the capability to adapt to different data formats, portability to a broad spectrum of representative software stacks, fairness (e.g. not comparing a system that is using the default configuration with a perfectly configured one), extensibility and usability. Furthermore, they gave an overview of existing benchmarks, which comprises ten instances and puts special emphasis on their data generation techniques. For the challenges, they especially highlight the controllability of the data velocity, the evaluation of the generated data's veracity and the enrichment of the data with meta information (e.g., arrival rate). Further, they call for the support of heterogeneous hardware platforms, a focus on the adaptability and reusability of existing components as well as an extension of the use cases of the benchmarks, since the respective applicability of the examined ones was rather narrow.
The most extensive overview of big data benchmarking tools was provided by [44]. In there, 20 benchmarks and two benchmarking platforms are described. Furthermore, it is highlighted, that the existing benchmarks are usually rather specific and that even though it is desirable to have an objective and universal benchmark, taking account of all the relevant aspects renders its creation a highly complex task. Besides that, as in the previous papers, the importance of workload diversity is stressed along with the possibility to integrate new workloads. It is also pointed out that workloads in a suite should be seen as complimentary and redundancy should be avoided. Additionally, an urgent need for new benchmark metrics is expressed, since the ones encountered in the course of the study were deemed insufficient.
An overview of quality assurance techniques was given in [45]. The paper lists and summarizes ten contributions that each deal with one of the following topics: testing, model-driven architecture (MDA), monitoring, fault tolerance, prediction and verification. Further, the functional or nonfunctional properties and big data Vs that are the most relevant with respect to those categories are listed. The authors also discussed three major issues, namely a lack of understanding of big data quality assurance techniques, a lack of quality assurance standards for big data and the challenge of keeping up with technological development. By including MDA and fault tolerance, it is highlighted that quality assurance can already be considered in the early design phases and in the choice of the development approach; and not just after something was implemented.
In the same year, a systematic mapping study of database management systems (DBMS) assessment was presented in [46]. There, it was analyzed, which DBMS were evaluated in the literature and which benchmarks were used for this purpose. During the study, the heterogeneity of the domain was highlighted, which showed in a plethora of encountered approaches. One additional finding was that a large proportion of the regarded primary studies was constituted by proposals and evaluations in a laboratory context, showcasing a lacking validation in industrial settings.
The first journal article to appear in this collection [47] also targets the domain of big data benchmarking. It proposes three categories, micro benchmarks (for individual components), endto-end benchmarks (for the entire system) and benchmark suits, which combine the former two to provide comprehensive benchmarking capabilities. Furthermore, the systems are also divided into three types, Hadoop-related-systems, data stores and specialized systems, which are further partitioned into subsets. Subsequently, the derived categories are joined to form a matrix. It is then used to categorize 37 regarded benchmarks. Additionally, this overview is enriched by additional information and the underlying data generation techniques are also specifically discussed. Further, an overview of the encountered evaluation metrics is given. The papers main contribution is, therefore, to provide a comprehensive summary of the domains state of the art. The main challenges for big data benchmarking that are pointed out in this article are the assurance of relevancy, portability and scalability as well as the generation of suitable test data and the assessment of their veracity.
In the latest of the collection's papers that are focused on benchmarking [48], six types of big data technologies are determined. Those are NoSQL databases, SQL Systems, batch processing, stream processing, graph processing and deep learning/machine learning. Example technologies for those types are stated, applicable benchmarks shown and corresponding scientific publications are introduced. This contribution once again exhibits that there is no universally applicable solution and benchmarking remains a rather individual and tool specific task. Nevertheless, the authors highlight the lack of collaborative efforts towards the creation of benchmarks and suggest the pursuit of community driven approaches.
While [49] is not a pure literature review on quality assurance in big data, but takes a broader perspective, it still extensively deals with those aspects. In the course of this work, several categories of quality assurance challenges are elaborated. Those are the verification of test results, the resource-intensity of the testing environments, which can make it really challenging to simulate real world scenarios, the generation of appropriate test data, error-tracing and the difficulties that come with distributed log files, verification, a weakened notion of data consistency and the assessment of data quality, which is, however, out of this review's scope. Specific to the software testing in the big data domain the paper states three dimensions. The first is the test objective with its three categoriesdata quality testing, functional testing and non-functional testing. As a second dimension, the granularity level states if algorithms, components, subsystems or the system as a whole are regarded. Finally, the test execution level states if a system is evaluated in a static fashion, with dynamic tests or if the running system is monitored. Furthermore, the paper highlights open research challenges such as the abandonment of costly test environments, the generally immature tool support for testing big data applications, scaling issues, the absence of a suitable testing approach that is geared towards the velocity as one of big data's characteristics and the oracle problem. To alleviate the latter for the quality assurance of BDA that are based on machine learning, the authors highlight metamorphic testing as a promising approach.
Finally, the latest contribution [50] can probably be seen as a continuation of [45]. This refers to the content, but also to the role of Pengcheng Zhang, who co-authored both papers. In comparison to traditional applications, for big data ones, it especially highlights the challenges of conducting statistical computation based on diverse data in large-scale, the utilization of machine learning techniques, decision-making under uncertain conditions and complex requirements for the visualization. Consequently, the corresponding solutions also have to be tested. In total, the paper covers 83 publications, which were found through a systematic literature review. It is therefore the most comprehensive one in the regarded list. The authors identify the most relevant big data quality attributes, illustrate their relations to the big data characteristics, give an overview of the big data quality assurance technologies that are the most frequently used and, as in [45], point out that quality assurance can already take place before the actual implementation. A main result of this work is that novel approaches for the testing in the big data domain are still needed, since there is a significant difference between the testing of traditional software and the testing of big data applications. Moreover, a prevailing necessity for crafting individualized quality assurance solutions is emphasized. Further highlighted challenges include a lack of awareness for and understanding of approaches for big data quality assurance and, in line with the findings of [49], a paucity of high quality tools for its execution.

Discussion
Looking at the found publications, a strong focus on benchmarking becomes apparent, whereas the testing of applications is rather underrepresented. However, there seems to be a growing interested in it, respectively a more comprehensive view of quality assurance. At least, when only regarding the timeframe starting in 2017, there would be a parity in the number of publications found in this review. Even though this is only a small sample and, therefore, not representative, it indicates a growing interest in more comprehensive quality assurance aspects that go beyond pure benchmarking. Furthermore, it is very noticeable that there has been a shift from conference publications to journal articles, which can be seen as an indicator for a growing importance of the topic.
While the answers to SRQ 1, SRQ 2, and SRQ 3 were rather straightforward, SRQ 4 and SRQ 5 are more complex. One of the common themes of the analyzed publications is an emphasis on the individuality of the quality assurance of a given solution. While there are numerous benchmarks, their applicability is always limited to certain use cases. A universally usable benchmark does not exist and, given the challenges for its creation, it can also be assumed that this situation will prevail in the foreseeable future. For the testing, the situation is similar. Even though certain patterns have been identified, the actual implementation is still an individual endeavor. While a need for novel approaches for the testing in the big data domain was explicitly stated, since there is a significant difference between the testing of traditional software and the testing of big data applications, and the benchmarking generally seems to be understood better, it is also far from being able to be considered mature. This lack of maturity was also highlighted across the board as well as the existence of many big challenges that can be mostly attributed to the properties of big data and its particular characteristics. Therefore, the most promising directions for future research, with the most immediate positive effects, seems to be the development of frameworks and approaches for the testing and benchmarking of big data applications that provide a general structure and at the same time grant a huge amount of freedom on the individual level. However, since for now the benchmarking seems to be more intensely researched and there are, while not being universally applicable, at least numerous solutions for individual tools and techniques, the shift in research interest from benchmarking towards more general studies, but also the testing in specific, appears to be justified.
Another repeated theme in the regarded publications was the segmentation of the quality assurance into different levels. This is mentioned the first time in the earliest of the papers [42] and still prevails [49]. Therefore, the creation of tests or benchmarks that act on a solution's component and system level can be seen as a necessity to account for the regarded application's complexity, with at least the testing probably even calling for more granularity as indicated in [49] but also other works such as [18]. The latter, proposing the application of test driven development in big data engineering, also serves as an example for new approaches to the quality assurance in big data, which are being called for, as in [50], to help advance the domain as a whole.

Conclusion
With a continuously increasing amount and complexity of data being produced and captured, traditional ways of dealing with their storing, processing, analysis and presentation are no longer sufficient. This led to the emergence of the concept of big data as well as countless tools and techniques for the implementation of corresponding applications. Since its harnessing promises sizeable benefits, organizations all over the world are heavily invested in it. However, when striving for those benefits, it is not only necessary to handle this rather new source of insights, but also to assure that the quality of the developed solutions is high, since otherwise the obtained results might be flawed, which can, in turn, lead to their disregard or even detrimental effects. Yet, the domain of quality assurance in big data engineering is far from mature. To facilitate bridging this gap, this article examined what is the current state of the art in the domain of quality assurance in big data engineering. For this purpose, a comprehensive structured meta-literature review on the topic of big data quality assurance was conducted. The results will provide interested researchers and practitioners with a solid foundation for their own quality assurance related endeavors and, therefore, help in advancing the cause of quality assurance in big data as well as the domain of big data in general. Furthermore, based on the findings of the review, worthwhile directions for future research were identified, providing prospective authors with some guidance in this rather complex environment. Those are mainly the development of high-level frameworks and approaches for the testing and benchmarking of big data applications, research concerning the segmentation of the quality assurance into different levels, the exploration of new approaches to the quality assurance in big data, and a general shift away from somewhat neglecting issues outside of benchmarking towards more general and testing related studies.
However, as for any review article and despite the authors' best efforts, a certain degree of subjectivity will always remain in the decisions concerning the inclusion or exclusion of the evaluated publications as well as in the analysis of those that were accepted into the final set. This might, in turn, influence the obtained results. Additionally, while the chosen approach of conducting a metareview allows for a broader overview compared to the use of primary sources, it, potentially, also reduces the level of detail that is captured and results in a high dependency on the preliminary of others. Even though the respective publications have all been rigorously evaluated through peer review, had to undergo additional screening in the course of this work and have, except for two of the most recent ones, also been recognized in the form of getting cited, this still constitutes a threat to validity that has to be taken into account. However, it also has to be recognized that the publication at hand just constitutes a snapshot in a rapidly developing environment where tasks, requirements, tools, techniques and methods constantly evolve. Therefore, the validness of some of its insights and propositions might be very limited in time. One possibility to track the domains development in the future as well as to further increase the findings' reliability and also enhance it could be to repeat the review in regular intervals and complement its insights with practitioner interviews.