Log Files vs. Next-Gen

Hosted Web Analytics Versus Server Log Files

As the demand for meaningful analytics increases, it continues to prompt new thinking about the means of collecting data for the purposes of analysis. For many years, Web server logs, also known as clickstream data, have served as the definitive source of online data. Data availability explained much of its popularity in the beginning.

For today’s needs, “next-generation” hosted analytics and its “snippet” technology offers targeted data collection and superior analysis capabilities when compared to log files. The perceived superiority of snippet technology over log files rests on several key points:

  • Snippets produce much more accurate data than log files;
  • Information collected using snippets has smaller storage requirements; and,
  • Snippet technology facilitates real-time reporting as opposed to batch file processing required for log files.

Snippet tracking and log files differ in terms of how they collect data, what specific types of data they collect, data accuracy, and data storage and handling.

Web servers are configured to produce a record, a log, of activity on the sites they serve. These server logs record such things as error and status messages, requests for pages, and transaction details; they were designed to serve the needs of site administrators seeking site metrics. Early on, they were awkwardly re-purposed by marketing researchers who were eager for new sources of site activity information.

Increasingly sophisticated demands for analysis of business (as opposed to site) metrics have continually revealed shortcomings in the use of log files for marketing, promotion and site business needs.

Consider the following quote from the usability-testing firm, User Interface Engineering, describing recent experiments with log files:

“The term ‘data swamp’ can be appropriately applied to Web logs. There’s a lot of data to trudge through before you find the information that can really be of use. We’ve found that Web logs can function as valuable tools for programmers but aren’t really designed to provide information pertinent to a company’s bottom line.”

Source: User Interface Engineering, Eye for Design, February 2001

As the demand for meaningful analytics increased, it prompted new thinking towards logs, data, site analysis and analytics. This resulted in a logical progression toward the use of code snippets.

How do these “snippets” work? When a snippet of analytics “tracking” code is read by a visitor’s Web browser, several small JavaScript executables are triggered to run on the visitor’s computer and collect information about the activity of the Website page where the code is embedded. Information collected is non-identifying, meaning it does not share who the visitor is, contact information or anything of that sort. What the information does include are details such as visits, pages viewed, responses to offsite or on-site promotions, and product viewing and purchases.

Snippets also permit the collection of user behavior, including what hardware and software is used to view the site. When combined with cookies, tiny text files stored on visitor computers, snippets are able to provide information on pages visited, number of visits, link clicking, and various timing and duration measures.

In contrast, log files collect records of server requests made by visitor browsers. When a visitor comes to a site, each Web page, graphic and some other components are separtately requested from the host server and logged as individual requests. It is these requests that make their way into the log files, along with some information about time and other basic factors. Log file analysis will most often take the accumulated log files for a set period of time and process them as a batch. This will require data sorting and often a considerable amount of time to analyze the data needed for specific site or activity analysis.

The opportunity to track and measure everything that occurs online makes the Internet the marketer’s best friend. Yet, the value of any reasearch is questionable if it is based on a poor methodology and less than accurate data. There are clear signs that log files, which were designed as an administration tool, do not provide the necessary accuracy needed for marketing analysis. Log files have been estimated to miscount site visits by as much as 40%. This is due to the fact that it does not discern between computer system checks, search engine crawlers, i.e., spiders, and human visitors.

Spiders and “bots” are terms for software programs that “crawl” across the Internet, usually to archive and classify sites and site content for various search engines. Many of the major search tools, and likely many more minor or local engines, use these programs to create search indexes and keyword databases. These types of tools have also become popular among online shoppers to automate price comparisons among Website vendors.

The important thing about these programs is that they often work by making requests of the server to access Web page information, and these requests appear in the server logs as legitimate requests, which will then be reported by log file analysis tools as legitimate visits and page views – but they aren’t visits by humans…or customers. Another factor that skews visit numbers in log files is that most spiders and bots do not “fire” data tags (i.e., execute the JavaScript) that are imbedded in site pages that are designed to prevent the inflation of visit and visitor counts.

Another feature of most browsers is the ability to cache visited Web pages. Caching is when Web pages are accessed and then stored in another location, either a users’ hard drive or a third party server. Caching eliminates the need to re-download pages from the Website server when it is subsequently requested by a visitor. Server log files only record requests made for a page. If a page is cached, the request will not reach the Web server logs and will therefore not be recorded. In other words, it won’t record each time a page is viewed from a cached browser link, bookmark or other means after the original view. The prevalence of any of these caching methods is likely to vary widely depending upon the popularity of a site. Popular sites are more likely to have pages cached, and therefore, have proportionately more pages uncounted in log files. In the best cases, server log files can not accurately count page visits. There are too many variables.

Next-generation snippet analytics has the advantage of being embedded within the page itself, whether accessed directly from the host Website or from a cache. Therefore, the data tag will “fire” (allowing the recording of a visit and other pertinent information) everytime the Web page it is on is accessed, whether it’s from a cache of the host server or not. Thus, in this regard, counts of page views or visits based on data tags will be significantly more accurate than log file analysis.

Any discussion of the relative merits of hosted analytics vs. log files has to address backend type issues such as performance and data storage. Log files have performance and collection costs. The collection and storage of log files demand processing cycles and memory from Web servers.

For large sites with millions of hits a day, the cost of these performance decrements can be considerable. The costs are compounded in the case of large Websites with multiple-server configurations, which creates numerous silos of log data that must be aggregated and cleaned before analysis can take place. And, one should not forget the effort and resources necessary to analyze this data, which, of course, does not take place in real time. This fact alone raises questions about the accuracy and usefulness of log file data; for time-sensitive factors such as the launch of new products and campaigns, it seems unreasonable to have to wait to receive reports on “old” data.

Hosted analytics eliminates many of the problems associated with storage and performance. For instance, the VisiStat hosted analytics solution eliminates the need to store, organize and prep log files. In addition, because VisiStat captures the data in real-time, the processing power to capture and organize user data is performed on dedicated servers.

The most significant advantage is that business analysts can look at the data whenever they want and see up-to-the-minute metrics. This is a major consideration when dealing with time-sensitive factors such as advertising campaigns, the launch of new promotions, etc., all of which require quick feedback for forecasting purposes.

Accuracy is paramount when dealing with data collection and analysis. The accuracy of data harvested via log files, however, is endangered by a number of factors such as caching, proxy servers and spiders/bots.

Page snippets represent a strategic method of data collection, a fact that reduces storage requirements, facilitates real-time reporting and ultimately delivers more accurate and useful data that is immediately actionable.

Copyright © VisiStat, Inc. All rights reserved; For permission to reprint this article, please send a request to Editor at VisiStat.com

2 Responses

  1. I’ll share it on Twitter.

Leave a Reply