Home Seminars Events Media Landscape Newsroom Media News Resources About EJC

Search the website

Magazine

Letting Google crawl all over you: Examining ACAP

By Jonathan Bailey

Published on April 30, 2009

Got something to say?

Share your comments with other journalists



The Automated Content Access Protocol, or ACAP, has been a hot topic for publishers. In the news it is usually discussed in light of news publishers’ recent conflicts with the search engines. And ACAP recently came into the news again when Google reaffirmed its refusal to adopt the standard.

Many developers recognise that ACAP is about conveying how publishers want their content to be used. What remains less clear, though, is what ACAP does, why news publishers are excited about it and why Google - as well as other search engines - have been reluctant to implement it.

To better understand ACAP, its goals and its problems, one first has to examine how the web works today and why publishers feel the need to change it.image

The current tool: robots.txt and META tags

Right now, when a search engine like Google, Yahoo!, etc., visits a site, it looks for what is known as a robots.txt file. That file instructs the search engine’s crawler what it can and cannot index. A sample robots.txt file can be found on Google’s site.

A robots.txt file is essentially a list of allowed and forbidden resources. For example, Google’s robots.txt file has allowed other search engines to index its “/searchhistory” directory but forbids other search engines from indexing the “/search” directory. If a search engine were to ignore this request they may, depending on the country, be in violation of copyright law.
Robots.txt files do have a very simple limitation: they only have two options: “allow” or “disallow”. While you can designate spiders by their name, site owners can do nothing beyond just opening or closing the door to the search engines.

To supplement this, some sites use META tags, which are hidden tags within a page that can provide further instructions to the search engines. These tags can tell the search engines to not to follow links on a page or to not cache it.

However, even META tags provide a quite limited set of possibilities. At the most, there are only a handful of options between fully allowing search engine crawling and blocking it. Most of these options are vague and unclear. For example, disallowing search engines could mean preventing spiders from crawling, meaning visiting the page, or disallowing indexing the content on it.

What ACAP attempts to do, then, is expand the protocol to provide more fine-tuned technological control over how the search engines can use a publisher’s content.

The advantages of ACAP

ACAP takes the idea behind robots.txt files - explaining to search engines automatically what they can and can not do with the site - and adds a slew of new instructions, including the following:

     
  • Crawling: Users can clarify their “disallow” statement by forbidding search engines to even visit a page.
  •  
  • Follow: Tells the search engine to follow or not follow links on the page/directory. Works like the META tag rule but uses a central file.
  •  
  • Index: This can be set to allow or disallow the search engine to index the content on the page. If disallowed but crawling option is still on, search engines can visit the page and follow links, but not save any content on the page.
  •  
  • Preserve: Allows or disallows search engines to preserve a copy of the exact resource, instead of just indexing the content.
  •  
  • Present: This deals with caching and can tell search engines whether or not they can present a copy.

All of the above items can have other details added, including a time limit,a set limit of characters for any snippet of text displayed and permission to include a thumbnail among others. Also, webmasters can direct search engines to take down their own pages from the index if there is an error, or request recrawling at designated times.

The end result of ACAP is that websites, in particular journalism sites, would have complete control over how their content appears in the search engines.

The problem is that, currently, no search engines support ACAP. With Google’s recent refusals, it seems unlikely any will in the near future.

Why search engines dislike ACAP

Currently, there are no search engines accepting or following the ACAP guidelines. Part of this is because although hundreds of sites have implemented the protocol, it is still very limited when you consider the many thousands of journalism-related sites and billions of other sites on the web.image

Where robots.txt is a widely-accepted and implemented standard, one that has also been backed by United States courts in lawsuits against Google in the case of Google Cache, ACAP is largely unknown and unused.

Furthermore, ACAP is much more complicated on the search engine’s end than robots.txt. The extra elements mean that spiders have to work harder to determine the wishes of the site owner and to obey them. When multiplied across millions of sites, the implementation of ACAP could be a very expensive proposition for Google and the other search engines.

Currently, there is little motivation for search engines to take up ACAP. There likely will not be an incentive until more news organisations take up the protocol and begin to actively enforce the terms in it.

In the meantime, as Google points out, robots.txt is adequate for most sites.

This leaves news organisations with the decision of whether or not to adopt ACAP on their own sites.

Implementing ACAP

A simple conundrum faces news organisations. Without widespread adoption of the ACAP standard, it is unlikely search engines will ever agree to use it. Without the search engines agreeing to use it, however, there is no reason to adopt ACAP.

The good news is that implementing ACAP is very simple. The ACAP organisation provides a very simple converter for taking a current robots.txt file and making it ACAP compliant. But this converter does not add any of the new elements of ACAP to the file. It just translates the existing file into ACAP-compliant language.
image
A more advanced ACAP implementation will have to be done by a dedicated IT staff that is capable of converting the implementation guides and converting them to a usable system.

These guides are straightforward but there will likely be some investment of manpower in making happen, especially if the end goal is to use ACAP to its full potential.

The question is whether it is worthwhile to do this on the hope that the search engines will someday accept the standard. That may be a tough sell, especially when news organisations are making cutbacks and dealing with both a slowing economy and the erosion of traditional media.

Important note

It is worth nothing that ACAP only deals with how search engines and other indexes interact with sites and their content. Most copyright infringers, like as RSS scrapers and human plagiarists already ignore robots.txt and meta tags. ACAP would not do anything to remedy those issues.

ACAP is not and was never designed to be a copyright protection system. It is merely a means to relay additional information to search spiders that follow the protocol.


—-
Flickr images from users parl and jpctalbot


Bookmark this :



Jonathan Bailey is a writer and webmaster from New Orleans. He graduated with honours from the University of South Carolina with a degree in Journalism and Mass Communications. He is at present an advertising specialist, graphic designer, IT guru and whatever else pays the bills. He became interested in researching and fighting plagiarism after a significant body of his own creative writing was plagiarised. He also runs his own website, Plagiarism Today.


Tags: acap, automated content access protocol, google, internet, search engine,

Related articles

EJC Newsletter

Subscribe to our monthly newsletter


Call for Writers

We’re looking for journalists from around the world to report on journalism and media trends and issues. Bring us original insights into innovations or challenges related to print, online, television, copyright, video and mobile journalism. Queries to editors@ejc.net.


Subscribe

Subscribe

Recent Articles



Popular Articles



Specials