2
0
mirror of https://github.com/offen/website.git synced 2024-12-27 06:50:22 +01:00
website/homepage/content/articles/0120-opt-in-quality.md
2021-03-04 09:59:54 +01:00

79 lines
7.3 KiB
Markdown

title: Opt-in for quality insights
description: Collecting data only with user consent has a less obvious implication: the quality of insights from web analytics increases.
date: 2020-10-28
slug: opt-in-quality
url: /blog/opt-in-quality/
sitemap_priority: 0.7
image_url: /theme/images/offen-blog-0120-opt-in-quality.jpg
author: Hendrik Niefeld
must_read: True
bottom_cta: matomo
# Opt-in for quality insights
### Fair web analytics
A key feature of our fair and open web analytics tool [Offen](https://www.offen.dev/get-started/) is that data will only be collected after website users have opted in. This is absolutely necessary for a fair data transfer, but also comes with another, not so obvious implication.
Collecting data only with *user consent has a significant impact on the quality of analytics* insights, especially for operators of smaller websites.
### Analyzing our own turf
Our own homepage [offen.dev](https://www.offen.dev/), on which of course an Offen instance is installed, can be described as rather small. It currently has an average of 280 unique users after opt-in and 660 verified page views per month.
We estimate our opt-in rate, meaning the percentage of website users who agree to the data collection, to be about 40%. This figure is a subjective estimate and derived solely from the personal feedback of a relatively small group of test users.
We cannot and do not want to measure the actual rate for obvious data protection reasons. What we are sure about is that it depends very much on the content offered. In our particular case, where product and presentation are so closely interwoven, it should be rather high compared to other content.
However, we do not think it really matters to know what your opt-in rate exactly is. But more about this later.
### Collecting data differently
First let's take a look at some numbers provided by our web analytics tool. These are the key metrics for our website from the randomly seclected time frame of 22 Aug 2020 to 30 Aug 2020.
<img class="mt3 mb2" alt="Figure A" src="/theme/images/offen-blog-0120-opt-in-quality-A.svg"/>
To get an overview of our total traffic in the same time frame we use [GoAccess](https://goaccess.io/){: target="_blank"} to analyze our server logs. Although "total traffic" is a rather symbolic term here, since the *exact number of visitors can never be determined* by any method. Even if we leave aside all non human traffic, a combination of adblockers, privacy tools and bugs reliably prevent an absolutely accurate measurement.
<img class="mt3 mb2" alt="Figure A" src="/theme/images/offen-blog-0120-opt-in-quality-B.svg"/>
Not surprisingly, far more data is generated in our server logs than with our web analytics tool, which is strictly based on user consent. But due to the following reasons that difference is so significant.
Visitors in the server logs are identified on the basis of a single day and could therefore have been counted several times during recurring visits. Also our logs count visitors and not unique users. This is because all non human traffic on our website is also covered. Which means that search engines indexing our website and all other page views generated by software agents are included.
According to the [7th Annual Bad Bot Report](https://www.imperva.com/resources/resource-library/reports/2020-bad-bot-report/){: target="_blank"} (Imperva Threat Research Lab, 2020), the average *non human traffic on websites has now grown to more than 37%.* Two thirds of this non human traffic accounts for so called bad bots. This software interacts with your website in the same way as a human user would do, which makes it more difficult to detect and block.
Let us therefore take a closer look at the quantity and quality of referrer domains collected by both methods.
<img class="mt3 mb2" alt="Figure A" src="/theme/images/offen-blog-0120-opt-in-quality-C.svg"/>
Our server logs collected more than twice as much data over the period. Unfortunately, one third of this was noise, which we were able to completely avoid with our own tool.
We consider entries that originate from server networks without an useful domain name or any obvious marketing content as plain spam. Furthermore, all entries without an explicit link that do not come from a search engine are considered questionable.
Perhaps these interferences have no relevance on websites with very high traffic. However, if your website never has more than a hundred unique users per day, the noise generated by *spam will have a significant impact on your analytics results.*
Common web analytics tools try to solve this problem by blocking single traffic sources. But all the domains considered questionable and some of the spam related ones would certainly be included there. In any case, this approach leads to long lists of spam referrers in the respective code, which by definition are always out of date. An arms race that the developers of these tools inevitably lose. Is all this really necessary?
### Real human users
We don't think so. An "opt-in only" policy for data collection, which is necessary anyway for privacy reasons, solves the problem along the way. Even if you wanted to, the opt-in banner can only be bypassed with a lot of engineering effort. This assures real human users with a very high probability. The best starting point for the optimization of your website.
Talking about these real users brings us back to the question of whether it is important to know your exact opt-in rate. For an answer, consider for which users you want to optimize your website and what kind of users you want to attract.
*Those who consent are most likely interested in your content.* They support you with their usage data and may therefore be willing to support you in any other way. The exact share of these users is less interesting.
### Deeper insights for optimization
Nevertheless, common web analytics tools that collect data without user consent provide at least better results than the analysis of your server logs. Yet their quality is bound to be lower compared to the results obtained with the smaller amount of data generated by opt-in only data collection. This is caused by the fact that some further issues are simply not manageable without some form of consent banner.
Many users are recorded even though they have visited your website with very little or no interest. Some bounce off immediately and may just have been there by mistake. Still, all these data points are included in your analytics and will give you a distorted impression. The resulting false assumptions distract you from the important users and make it difficult for you to keep the necessary focus.
This is why the use of all available data is not the way to do better web analytics. Only *a careful selection of the data to be evaluated leads to deeper insights* for optimization. All the better if this can be done in combination with a privacy friendly approach to data collection.
### Try Offen today
If you are looking for a self hosted as well as lightweight alternative to common web analytics tools and want to optimize your website for quality you should give Offen a try. Why not let it run parallel to your current tool for a while and then see how it feels? We are looking forward to your feedback.
Give it a spin with our [demo](https://www.offen.dev/try-demo/) or directly head to our [get started](https://www.offen.dev/get-started/) section.