Magazine
Letting Google crawl all over you: Examining ACAP
Published on April 30, 2009
The Automated Content Access Protocol, or ACAP, has been a hot topic for publishers. In the news it is usually discussed in light of news publishers’ recent conflicts with the search engines. And ACAP recently came into the news again when Google reaffirmed its refusal to adopt the standard.
Many developers recognise that ACAP is about conveying how publishers want their content to be used. What remains less clear, though, is what ACAP does, why news publishers are excited about it and why Google - as well as other search engines - have been reluctant to implement it.
To better understand ACAP, its goals and its problems, one first has to examine how the web works today and why publishers feel the need to change it.
The current tool: robots.txt and META tags
Right now, when a search engine like Google, Yahoo!, etc., visits a site, it looks for what is known as a robots.txt file. That file instructs the search engine’s crawler what it can and cannot index. A sample robots.txt file can be found on Google’s site.
A robots.txt file is essentially a list of allowed and forbidden resources. For example, Google’s robots.txt file has allowed other search engines to index its “/searchhistory” directory but forbids other search engines from indexing the “/search” directory. If a search engine were to ignore this request they may, depending on the country, be in violation of copyright law.
Robots.txt files do have a very simple limitation: they only have two options: “allow” or “disallow”. While you can designate spiders by their name, site owners can do nothing beyond just opening or closing the door to the search engines.
To supplement this, some sites use META tags, which are hidden tags within a page that can provide further instructions to the search engines. These tags can tell the search engines to not to follow links on a page or to not cache it.
However, even META tags provide a quite limited set of possibilities. At the most, there are only a handful of options between fully allowing search engine crawling and blocking it. Most of these options are vague and unclear. For example, disallowing search engines could mean preventing spiders from crawling, meaning visiting the page, or disallowing indexing the content on it.
What ACAP attempts to do, then, is expand the protocol to provide more fine-tuned technological control over how the search engines can use a publisher’s content.
The advantages of ACAP
ACAP takes the idea behind robots.txt files - explaining to search engines automatically what they can and can not do with the site - and adds a slew of new instructions, including the following:
- Crawling: Users can clarify their “disallow” statement by forbidding search engines to even visit a page.
- Follow: Tells the search engine to follow or not follow links on the page/directory. Works like the META tag rule but uses a central file.
- Index: This can be set to allow or disallow the search engine to index the content on the page. If disallowed but crawling option is still on, search engines can visit the page and follow links, but not save any content on the page.
- Preserve: Allows or disallows search engines to preserve a copy of the exact resource, instead of just indexing the content.
- Present: This deals with caching and can tell search engines whether or not they can present a copy.
All of the above items can have other details added, including a time limit,a set limit of characters for any snippet of text displayed and permission to include a thumbnail among others. Also, webmasters can direct search engines to take down their own pages from the index if there is an error, or request recrawling at designated times.
The end result of ACAP is that websites, in particular journalism sites, would have complete control over how their content appears in the search engines.
The problem is that, currently, no search engines support ACAP. With Google’s recent refusals, it seems unlikely any will in the near future.
Why search engines dislike ACAP
Currently, there are no search engines accepting or following the ACAP guidelines. Part of this is because although hundreds of sites have implemented the protocol, it is still very limited when you consider the many thousands of journalism-related sites and billions of other sites on the web.
Where robots.txt is a widely-accepted and implemented standard, one that has also been backed by United States courts in lawsuits against Google in the case of Google Cache, ACAP is largely unknown and unused.
Furthermore, ACAP is much more complicated on the search engine’s end than robots.txt. The extra elements mean that spiders have to work harder to determine the wishes of the site owner and to obey them. When multiplied across millions of sites, the implementation of ACAP could be a very expensive proposition for Google and the other search engines.
Currently, there is little motivation for search engines to take up ACAP. There likely will not be an incentive until more news organisations take up the protocol and begin to actively enforce the terms in it.
In the meantime, as Google points out, robots.txt is adequate for most sites.
This leaves news organisations with the decision of whether or not to adopt ACAP on their own sites.
Implementing ACAP
A simple conundrum faces news organisations. Without widespread adoption of the ACAP standard, it is unlikely search engines will ever agree to use it. Without the search engines agreeing to use it, however, there is no reason to adopt ACAP.
The good news is that implementing ACAP is very simple. The ACAP organisation provides a very simple converter for taking a current robots.txt file and making it ACAP compliant. But this converter does not add any of the new elements of ACAP to the file. It just translates the existing file into ACAP-compliant language.

A more advanced ACAP implementation will have to be done by a dedicated IT staff that is capable of converting the implementation guides and converting them to a usable system.
These guides are straightforward but there will likely be some investment of manpower in making happen, especially if the end goal is to use ACAP to its full potential.
The question is whether it is worthwhile to do this on the hope that the search engines will someday accept the standard. That may be a tough sell, especially when news organisations are making cutbacks and dealing with both a slowing economy and the erosion of traditional media.
Important note
It is worth nothing that ACAP only deals with how search engines and other indexes interact with sites and their content. Most copyright infringers, like as RSS scrapers and human plagiarists already ignore robots.txt and meta tags. ACAP would not do anything to remedy those issues.
ACAP is not and was never designed to be a copyright protection system. It is merely a means to relay additional information to search spiders that follow the protocol.
—-
Flickr images from users parl and jpctalbot
Tags: acap, automated content access protocol, google, internet, search engine,
Related articles
- Why the Europeana initiative is still important
- German media stakeholders discuss regulations for commercial TV in an Internet era
- Sentiment analysis in the blogosphere: the potential of SYNC3
- Canada’s parliament seeks Google’s advice about new media
- Canada’s Digital Strategy: All talk and no action?
- Google’s Newspass not a magic wand for publishers
- Is News Over?
- Google: friend or foe for news publishers?
- The Future of Internet Rights: A Conversation with Industry’s Leaders
- Sir Berners-Lee and the African journalist
EJC Newsletter
Subscribe to our monthly newsletter
Call for writers
We’re looking for journalists from around Europe to report on the media landscape in their backyard. Bring us original insights into innovations or challenges in your area of the EU related to print, online, television, copyright, video and mobile journalism. We might even pay you. Queries to editors@ejc.net.
Subscribe
Popular articles
- New media and social change in the Arab and Muslim world
- Separating journalism and the media
- Books that journalists should read: Edwin Black
- Magazine layouts gain popularity with blogs
- The public broadcasting license fee and public value
- Seven simple writing tips for social news
- The road to journalism: Why we choose to be journalists
- The language economy and the credit crisis
- Citizen journalism in the age of global terrorism
- From outside Iran, Jaras reports on the Green Movement
- Challenges of the European Neighbourhood Policy: One
- Blogskeptics ponder regulation in Europe
- Innovation Journalism: Copyright and Creative Commons
- Reporting the financial crisis: A media failure?
- The German TV market as seen from abroad
Specials

Got something to say?
Share your comments with other journalists