Import of the URL that a user wants to check and see if they are indexed
It is useful to give the report a name, by clicking on Dataset name that makes it easy to be identified. In case a user wants to save input dataset for future use for different purposes, they can check the box Save dataset.
By clicking on this button, a user gets to the miners selection.Step 2
Miner selection and data collection
In the miner selection section, a user selects Fulltext Index Checker, which is a miner that inquires about given URLs for search engines in connection with an operator info:, and this way checks if given URL is indexed and if a search engine returns the same URL as the one that was input (canonicalization check,..)
User then clicks on Get data, which will move them into data processing section. Based on data volume, they are processed in the background and once completed, results are emailed to a user.
|Keyword/URL||URLs, of which indexing was checked|
|Google IndexDetection, whether URL is indexed by the search engine. It returns either TRUE (indexed) or FALSE (not indexed)|
|URL in results||Information about what URL, entered by the operator info: was returned by the search engine|
|Same as input||Comparison, whether URL at output is the same as the one at input. It can be useful for canonicalization functionality identification as well.|
Non-indexed websites check
In terms of an output, a user should be primarily interested in the column Google Index that indicates if given URL is indexed in given search engine (TRUE/FALSE indication). The correct procedure is filtering out the list of non-indexed websites and using these to try to find out why aren't these shown in the search engine index and how to rectify the situation.
In special cases, there could be a different URL at output, by using operatorlink:, than the input one. This is a sign of the fact that the search engines know about given URL, but it uses its canonized URL in search results. To detect these URL, see the column Same as input, which returns either TRUE, in case given URL at output is the same as the one at input, or FALSE, if not.
An output can then be analyzed by the user with a use of tools that can work with XSLX outputs. We recommend these step-by- step instructions of analysis below:
|Excel instructions||Link to download tools|
|OpenRefine instructions||Link to download tool|
|Tableau Public instructions||Link to download tool|
Examples of a use in practice
Below, you can find procedures that, using this miner, can be applied in practice.
Web indexing check, identification and problem solving
The most common application of the miner is using it to check the indexing of entire structure. A user can do so by selecting import of entire sitemap into Marketing Miner (see previous instructions) and by checking the miner box Fulltext index checker. After collecting the data, it is necessary to analyze the output and get the summary of categories and types of websites in the hierarchy that are not being indexed and find out why.
In order to be able to locate critical spots of indexing, it is necessary to analyze output data which, in this case, we can try in the tool OpenRefine. Output data needs to be imported and then a project needs to be created in OpenRefine. We can do so by importing the output file from Marketing Miner in sectionCreate Project and creating the project. In the next step, we need to select the list that OpenRefine will use to extract data from, specifically the list Data. The setting before creating the project can look like this:
Example data source can be downloaded at:
After creating the project , a user can focus on an analysis of critical spots of indexing. First, it is necessary to show, how to filter out URLs that are not indexed by one or another search engine. This will get done by fulltext facet (word clustering based on the consensus of cells) above the particular column that will create, on the left side of the project, a summary of website indexability in given search engine.
The image above shows that the facet was created above the column Google index and that 118 URLs is indexed by Google and 11 is not. By clicking on relevant facet you can see the particular URLs.
Identification of problematic sections
Above mentioned process can be done in any type of tool that works with spreadsheets, for example Excel. OpenRefine was used primarily for its efficiency of working with facets, thanks to which a user can inspect the sectional arrangement of non-indexed websites. The next step is then the division of URLs into categories and sections.
First, it is necessary to cut off the domain from URL so it is not interfering with the outputs. This gets achieved by selecting, above the URL column, Edit cells -> Transform, and inserting following GREL into transformation field:
So in the above shown example value.replace("http://www.podstavec.cz/","") and confirm. This change will cut the domain off of every URL in the column.
A user will then, with help of text facet, filter out non-indexed websites of one of the search engines and, after data filtration, can then start analyzing non-indexed website examples. This is done by selecting Facet -> Custom Text Facet, above the column with URLs and then choosing where to insert following GREL:
Where instead of [SEPARATOR] states URL separator (if there is one existent among them). In case of the web podstavec.cz, it is, for example, a slash, and that's why the final GREL regarding above example is going to be value.split("/"). According to the separator, OpenRefine will cut up URLs into several segments and will then count their total occurrence in the filtered-out view. It will show a user, what sections/categories have problems with indexing in the search engines (of course, only in case a user has, at that moment, only filtered out URLs that are not being indexed by one of the search engines). Anything can be a separator. The most common separators are:
An output of above mentioned example of a facet using a separator "/", is the following table (in descending order, according to the number of occurances):
The above table shows that, in URL, the most often occurring fragment is author, then %C5%A1t%C3%ADtky (encoded version "tag") and filip-podstavec. The problem, then, is primarily in the author section and tag section. After clicking on facet segment, a user can directly look at involved URLs.
After a short analysis of above-mentioned examples, the problem was discovered. In the /author/ section, a command for robots noindex was involved. Which raises a question whether it is right in case of these websites. In case of the tags, the problem was only on the side of Seznam, which is probably not able to work with URLs with diacritics or the problem could be with duplicate content of articles on those URLs.
The fragments in URL are not always able to be used in getting the categorization. And so another advantage of OpenRefine is the function Edit column -> Add column by fetching URL that can help user download the source code of imported URLs and then parse, for example, a breadcrumb trail or a different element that will determine the type of the website. Instructions on how to do so is here:
Canonicalization check and preferred web versions
Marketing Miner checks URLs in search engines with the help of operator info:. It, however, in specific cases, does not necessarily return given URL, it returns its canonical version. An example can be:
In the first case, Google returns a completely different URL. In the second case, it returns its version in HTTPS. That means that given URL is indexed, but its canonical version or redirected version is preferred, which is returned in index.
To check such cases, after checking the box Fulltext Index Checker, in Marketing Miner, to click onURL output check. In such case, the tool also directly checks output URLs of the search engines and sends back the information about whether it's the same as the input one.
In the above mentioned output from the miner, we can see that there were the two above-mentioned URLs at the input. The columns Google Index and Seznam Index return user the information whether given URL is indexed (implying there was something at output). Columns Same as input analyze, whether the search engine output is the same as input or the search engine uses a different version of URL. And in the column URL in results you can see the actual version of URL.
So thanks to that, a user can easily analyze whether a search engine, with some URLs, uses a different URL version than it should or they can check the efficiency of used canonical attributes.
Indexing self check
A user can check indexing of some URL by themselves. Different search engines use different methods:
Google indexing check
The most common and the easiest way to check URL indexing in Google is the info: operator. If a user inserts it into search results in the combination with URL, he then gets information from the search engine whether that website is indexed. Example:
The second option is then verifying the website in Google Search Console service, which takes care of the interpretation of basic data and information between webmasters and a search engines. It contains a section with the option of inserting a sitemap of a search engine robot. This section then shows an accumulative number of indexed URL from a given sitemap by the search engine. This method doesn't show user the exact list of URL that are/are not indexed, but it can be sufficient as an orientation point.
For a more exact analysis in the Search Console service, the individual sitemaps can be divided according to their types (i.e. categoric sitemap,...) and then watched what part of URL was indexed by a search engine for the different types of content.