UPDATE: The problem of referrer spam has grown since we first published this article. The original focus was excluding Semalt, but our current recommendation for removing spam from Google Analytics covers a much broader range of troublemakers, including ghost spammers. Very important note: never, ever visit the site of a suspected spammer that appears in your analytics data.
If you pay attention to your site traffic and dig into Acquisition data in Google Analytics to explore domains that refer traffic to your site, you may have seen one domain in particular making a frequent appearance on your referrer list. You’re not alone – far from it.
That same domain is appearing in the referrer list of websites across the entire internet. Semalt.com, including all of its many subdomain variations, first raised its ugly head around December 2013. Many sites saw a jump in Semalt traffic in February 2014, followed by a drop off and a second but smaller rise in July.
We don’t plan to link to their domain at any point in this article. We do plan to give you a better understanding of what semalt.com is, why it’s a problem, and what you can do about it.
Throughout a discussion of Semalt, it’s important to keep some perspective. Semalt is a nuisance, and for many a big nuisance. But based on its behavior to date, it’s not a security threat. It distorts the analytics data for your site, but we’ll provide ways to correct that distortion under Google Analytics. If your site gets a lot of both total and referral traffic, the distortion still exists, but it’s probably not significant and it’s significant. In fact, referral spam has grown so much that if you don’t exclude it, you’re probably not looking at good data. This article is worth a read because you’ll learn how Google Analytics features, including one released very recently, can be used to improve analytics data through better use of filters and segments.
Semalt’s Reputation Around the Web
On its site, Semalt purports to be a “webmaster analytics tool” but we – along with many others – are skeptical. Online Threat Alerts observed that the “service this website provides and is asking you to pay for, is available online for free.” Hello SEO Copywriting discussed the “social media outcry” and included several tweets from Semalt manager @AlexAndrianov1, but his tweets are now protected. Perhaps responding to the complaints was more than a single person could handle.
In Bill Hartzer’s post discussing ways to block Semalt, he mentions reports that claimed Semalt used a trojan botnet to do its crawling and a Semalt-related site served as a malware distributor. His take: “…this is a mess.”
One of the most amusing versions of the Semalt story comes from The Grinning Skull, offering a reply from Semalt in response to an email from that site:
Here’s Semalt’s reply to The Grinning Skull:
Thanks you for your explicate and frank letter.
To begin with Semalt … is not a phishing or spam site, but a professional keyword ranking monitoring service that has nothing to do with CIA, NSA, Snowden, Masonic conspiracy or the World Evil.
I would like to bring apology on the behalf of our company if you had some troubles using our service.
Note to Semalt: When your own defense includes a denial of “Masonic conspiracy or the World Evil”, you’re probably up to no good.
Why Semalt Is Bad
Whatever its real purpose, Semalt is bad for us as a host and for you as a webmaster.
Semalt Traffic Is Referral Spam That Wastes Bandwidth
Let’s tackle the simpler of those 2 reasons first. Semalt is not a real referrer; it’s referral spam. Referral traffic is traffic where a web surfer clicked a link to your site while vising the referring site. The server on the referring site makes use of the optional Referer field in the HTTP Request to inform your server about the last step in the visitor’s surfing journey. This referral data is captured by analytics programs such as Google Analytics so you can understand your site’s traffic sources. No web surfer clicked a link to one of your pages on Semalt’s site. Instead, Semalt itself visited your site but stuffed the Referer field with its domain information, creating a false impression and distorting your analytics data.
Since it does not represent real referral traffic but masquerades as such, Semalt wastes bandwidth. For individual sites hosted on wpPERFORM, there’s no meaningful impact on server performance because of Semalt traffic, but that may not be true if your site is hosted on shared hosting with very limited resources. For us as a host, Semalt wastes bandwidth in small amounts at virtually every site we host, but combined, that adds up.
UPDATE: Over time, analytics spammers have developed different techniques to corrupt your analytics data, and many never even visit your site. Instead, spammers use the Google Analytics Measurement Protocol to send data directly to your Analytics account without ever dropping by your site. This is referred to as ghost spam. That makes the “wasting bandwidth” argument less and less accurate over time, but the impact varies by site and the mix of spammers in your analytics data.
For Webmasters, Semalt Is a Disaster Because It Distorts Analytics Data
If you’re not familiar with the Semalt problem and ignore it, you end up with distorted analytics data. Your reported traffic and bounce rate are higher, while your session duration, share of returning visitors, and pages viewed per session are lower than the real values of those metrics without including Semalt traffic. Traffic from Semalt consistently has these characteristics:
- a 100% bounce rate, which inflates your actual bounce rate
- a 0 second session duration, which lowers your average session duration
- every Semalt session is a new session, which lowers your share of returning visitors
- a single page view, which lowers your average pages viewed per session if they’re otherwise in excess of 1
All of those characteristics make your visitors appear less engaged than they really are, and that could lead you to waste time and money trying to fix an engagement problem that isn’t real.
Semalt inflicts the most pain on low traffic sites, because it introduces the most distortion to their analytics data. Semalt’s traffic represents a bigger share of their total and referrer traffic, so the distortion can be significant. If your site only got 10 total referral visits on a given day and 5 of those were fake Semalt referrals, your data was inflated to 2 times its real value. That’s a real scenario we’ve seen on many wpPERFORM sites. Lower traffic sites are also more likely to be run by small businesses without the technical resources to deal with the problems that Semalt manufactures.
If you’re aware of the Semalt problem, you’ll have the burden to restore accuracy to your traffic monitoring, but Semalt’s tactics mean your extra work might never be finished.
As Dragonfly PR observed just a few days ago:
Once you have excluded one, another simply turns up. These visits are reported to be coming from over 20 countries around the world, including Brazil, Italy Mexico and Croatia, and are all doing the same thing – increasing bounce rates and decreasing page / visits and page views.
We’ll cover some of the extra work that Semalt piled on your desk in a bit and hopefully offer durable techniques to restore accuracy to your analytics data.
We’ve Blocked Semalt
While we give you a moment to gear up for dealing with the Semalt problem, we didn’t leave you alone. As of 8/24/14, we took steps to block Semalt from all sites on our network.
IMPORTANT NOTE: Due to our blocking technique, you can’t use the string semalt.com
in any permalink hosted on wpPERFORM.
UPDATE: Semalt isn’t the only spam referrer we see across our network. At the beginning of November 2014, we observed spam referral traffic from buttons-for-website.com
, so we blocked it. You can exclude referral traffic from this site reported by Google Analytics using the same techniques described in this article.
As others have described, we can’t be certain those steps will work forever, because Semalt is likely to work to try to find ways around them. The outcome will most likely be an unending and completely unproductive succession of adjustments. It’s a shame that those with the ability to build Semalt can’t put those skills to productive use. Since we can’t be sure that our efforts to block Semalt traffic will be 100% successful, you shouldn’t rely on them exclusively going forward. Hence the need for you to exclude Semalt from Google Analytics.
In blocking or making it easy to ignore Semalt, we’re not alone. WordPress.com was among the first to respond. In February 2014, WordPress.com implemented a technique to address referrer spam in its Stats module for sites hosted there. While the technique is a great step forward, it requires action on your part to identify Semalt as referral spam. Stats are part of Jetpack, but the technique described is not available in the Jetpack plugin for self-hosted WordPress sites as of this writing.
Web analytics provider Clicky received complaints of the “spam scam site Semalt” and decided to ignore Semalt traffic as of 3/17/14.
Wordfence, a very popular WordPress security plugin, outlines ways to block Semalt using Apache’s .htaccess and blogger Charlie Harvey blocked Semalt under Varnish.
We’re not using either of those blocking methods. We’ll happily – but tentatively – report that since blocking Semalt across our network, we’ve seen no traffic from them. Here’s hoping that continues. Posting our method here would only make it easier for Semalt to get around it, so we don’t plan to do that. We’ll provide our method to verifiable, legitimate requesters.
How To Exclude Semalt and Other Spammers From Google Analytics
Let’s dig into the possible methods to exclude Semalt and other spammers from Google Analytics. Not all of them are effective, and each method has limitations. You should probably use all of the methods at the same time to correct both past and future Google Analytics data for the distortion that Semalt introduced. All of the methods start from the Google Analytics home menu:
Create a New View Before Applying New Filters
Several techniques discussed below involve creating View filters.
Since View filters alter the data that’s collected by Google Analytics, we recommend you keep 1 view of all traffic and create a new view with any filters applied. That gives you the ability to look at all data should that need ever arise. There’s a limit of 50 views (profiles) per non-premium account and 200 views (profiles) per premium account, so Google provides ample flexibility to view data in a way that suits your needs.
To create a new Google Analytics view, click the dropdown at the top of the View column on the far right of the Admin menu, and then select the choice to Create new view. The view that’s labeled with your domain name (in our case, wpPERFORM.com) represents all website data subject to whatever filters you applied to it already. You switch between views for a property on the Home menu. While you’re creating a new view, it’s a good idea to rename the initial view from your domain name to All Website Data to better remind you what it really represents.
Your new view will start collecting data once you save it; it won’t include older data that’s included in other views.
Bot Filtering
In late July 2014, Google announced bot and spider filtering for Google Analytics.
This is a single checkbox setting found under the Admin menu in Google Analytics. To perform this task, your user permission must include the the Edit permission. To reach the setting, click Admin, select your property in the View column on the far right of the Admin screen, and click View Settings.
The screenshot below shows the single checkbox for bot filtering:
In all situations, we recommend that you check this checkbox to turn on bot filtering. It’s off by default.
Google uses the Interactive Advertising Bureau’s IAB/ABC International Spiders & Bots List to apply this filter, and since the list is only available to members, we don’t know if Semalt was on the list when we did our initial testing in early August. However, in our tests, the bot and spider filter didn’t block Semalt referrer spam. Thus, while it’s a good step to take, it’s not enough.
Exclude Future Semalt Traffic Using a View Filter
Google Analytics View Filters are powerful tools to filter data. Like the checkbox for bot filtering, they’re applied at the View for a property.
To create a View filter to exclude Semalt traffic, click Admin, select your property in the View column on the far right of the Admin screen, and click Filters.
Click the button to create a new filter, assign it a Filter Name of Exclude Semalt, and click the radio button for Custom filter. Set the Filter Field to Referral and the Filter Pattern to semalt.com. Keep the Case Sensitive radio button on the No selection.
Once your settings match the above screenshot, click the Save button to save your View filter.
While View filters are powerful, they have a big limitation: View filters are not applied retroactively. In situations where you’ll make relatively infrequent changes to filters, View filters are a great way to improve data reliability going forward. Since the consensus outlook isn’t likely to change on the need to exclude Semalt from Google Analytics reporting, this circumstance is a great candidate for a View filter. But because it’s only forward-looking, applying a view filter is not a complete solution to the Semalt problem. Now that you understand how Semalt distorted your data, you need to correct analytics reports for prior periods – before you applied new filters.
Adjust Past and Present Google Analytics Reporting Using Custom Segments
Google Analytics Custom Segments enable you to slice and dice the data that Google Analytics has already collected. That is, filters adjust what is collected in the first place; segments adjust how what was already collected is presented in a report.
You can use custom segments to correct past analytics reports by excluding Semalt traffic. In fact, you can build a custom segment to exclude bad traffic that meets several criteria in addition to Semalt spam referrals.
To create a custom segment, click on Reporting in the main Google Analytics menu, and visit Audience Overview in the left side menu. Click the + Add Segment button at the top of your Google Analytics dashboard and then start building your new segment by clicking the + New Segment button.
Give your new segment a name (such as Exclude Semalt) and select Traffic Sources in the segment builder menu. To maximize reporting over time, click the Filter Sessions button. Set the Source to semalt.com and choose does not contain from the dropdown.
There are some limits to what you can do with Google Analytics custom segments. For example, session data covers the full calendar range available in Google Analytics, but user data only spans a 90 day range. For the purposes of excluding Semalt, we recommend you stick to session data so that the 90 day date limitation won’t apply.
When you’re happy with your custom segment, click the Save button to save it and see it applied to the report shown.
Our Current Recommendation – You Must Block a Lot More Than Semalt
As the number of spammers has grown, the technique of using a custom segment to filter out only Semalt is far too limited. Therefore, we’ve generalized our recommended technique to create a custom segment that includes only selected hostnames and excludes a number of sources we’ve found to represent spam.
To do this, you should start by identifying good hostnames and bad sources.
To identify good hostnames, follow these steps:
- Visit Acquisition->All Traffic->Channels in your Google Analytics dashboard
- Click the button to set the secondary dimension to Hostname (which is in the Behavior group); type Hostname in the search box and Google Analytics will find it for you automatically
- Click Hostname in the search results returned by Google Analytics to select it
- Adjust the time frame covered by the report to a reasonably long period, such as 1 year
- Increase the number of rows shown (see the bottom of all rows) to minimize the number of pages you have to scroll through
- Review the Hostname column to identify fake traffic
Good hostnames include your own domain, as well as hostnames from external services. External services can include those you’ve set up to use on your site, such as payment gateways, or services used by visitors to your site, such as Google’s translate service. For most sites, the list of good hostnames is relatively short.
To identify bad sources, follow these steps:
- Visit Acquisition->All Traffic->Channels in your Google Analytics dashboard
- Click the button to set the secondary dimension to Source (which is in the Acquisition group); type Source in the search box and Google Analytics will find it for you automatically
- Click Source in the search results returned by Google Analytics to select it
- Adjust the time frame covered by the report to a reasonably long period, such as 1 year
- Increase the number of rows shown (see the bottom of all rows) to minimize the number of pages you have to scroll through
- Review the Source column to identify fake traffic
Fake traffic tends to produce extreme values (ie, very high or very low) in certain metrics such as Bounce Rate, Pages / Session, or Average Session Duration. In a large report, click the headers of selected metrics to resort the data to bring extreme values to the top of the table. The names of the spammers will vary somewhat from site to site, so your review of the list is important. The virtually all sites, the list of bad sources is long, and you’re not likely to catch them all even after careful scrutiny. Our example below will filter out the spammers we see regularly.
Follow the directions above to create a custom segment. However, to make our segment flexible, in this case we’ll base it on a regular expression, or regex. A regex will filter hostnames or sources that match a pattern, so subdomains of the hostname or source will automatically match. That means that you won’t need to update your filter every time a spammer (or a good service!) uses a new subdomain of a filtered domain.
The regex should:
- Include good hostnames, especially your own domain name
- Exclude bad sources, such as known spammers
Both filters can be included in a single custom segment by clicking the + Add Filter button after setting up the first filter. It’s always a good idea to preview the results returned by a filter, particularly when using regular expressions.
We’ve given our custom segment the appropriate name of Real Traffic (Include Hostnames, Exclude Sources). The screenshot below shows the details:
To edit and use them on your site, here are the regular expressions – including a link to easily import them into your Google Analytics account – for both our hostname and source filters:
For reference, adding just the filter to include selected hostnames reduced our session traffic by about 13%. Adding the filter to exclude spam sources reduced our traffic by a further 2%, or about 15% with both filters applied as shown.
Keep in mind that these filters are for our domain, wpperform.com. You’ll need to replace our domain with yours in the first filter that includes hostnames. The filter to include hostnames will be relatively static. It will only change as new services spring up that deliver traffic to your site. The filter to exclude sources inevitably will be more dynamic. Some will become inactive, but new troublemakers will appear, requiring an update to the list. We’ll update the code above based on traffic we observe, but your list should be built based on the spammers you see in your analytics reports. That means you’ll need to do periodic reviews of the filter’s effectiveness at excluding spammers.
A Note On Setting a Referral Exclusion If You’re Using Universal Analytics
Universal Analytics is the next generation of tracking from Google Analytics, but most sites on wpPERFORM continue to use classic analytics. If you’re using Universal Analytics, you can set a referral exclusion, but it’s not a good solution to stop counting traffic from spam referrers. It merely removes the spam referrer from your list of referrers and turns its traffic into direct traffic – which is an undesirable outcome because your analytics are still messed up.
To reach the setting, click Admin, select your property in the Property column in the middle of the Admin screen, and click Referral Exclusion List in the .js Tracking Info menu.
As Google suggests, referral exclusions are best suited to “exclude traffic from a third-party shopping cart to prevent customers from being counted in new session and as a referral when they return to your order confirmation page after checking out on the third-party site.” That’s not the game spam referrers such as Semalt are playing, so it’s not a good use of this tool.
Looking Ahead
Semalt is referrer spam that distorts the analytics data for your site, but it’s not likely to be the last of its kind to make an appearance on your Google Analytics reports. Read those reports with some skepticism, and explore the sites tracked as referrers by Google Analytics.
You should recognize the domains of those who have linked to your content or easily research them if they’re unfamiliar to you. If you can’t readily establish why a site is listed as a referrer, you may have just stumbled on the next Semalt. The techniques described above can be applied to correct Google Analytics for any bad traffic you deem worthy of exclusion.
We’ll continue to monitor our efforts to block Semalt and make adjustments as necessary. If you have questions about how to filter out Semalt from your Google Analytics data, just get in touch with our support team.
couldnt get the filters to apply correctly as they simply removed all traffic from the GA view or property. But we simply excluded Russia from the view and that seems to have worked for now
thanks a lot for the information. our website is logging many unwanted bots. Have done the changes probably these will be blocked.
Thanks so much for this article. My humble blog is absolutely one of the ‘low traffic sites’ you mention above, and I need what little information I get from analytics to be as accurate as possible. Every other ‘fix’ I had found involved adding code to my site, but this is far more convenient and straight-forward.
Thanks again.
We appreciate the feedback, Kevin. For the average webmaster, changes to a server involve downtime risk, and for the Semalt’s of the world, those changes are easily circumvented. There are a lot of articles on the web that discuss applying IP blocks to operators such as Semalt, and those types of blocks aren’t effective because Semalt has a seemingly endless supply of IP addresses. That’s why we chose to focus on correcting Google Analytics data. Keeping sweating the details and in time it will pay off with more traffic.