Magazine
Letting Google crawl all over you: Examining ACAP
Published on April 30, 2009
The Automated Content Access Protocol, or ACAP, has been a hot topic for publishers. In the news it is usually discussed in light of news publishers’ recent conflicts with the search engines. And ACAP recently came into the news again when Google reaffirmed its refusal to adopt the standard.
Many developers recognise that ACAP is about conveying how publishers want their content to be used. What remains less clear, though, is what ACAP does, why news publishers are excited about it and why Google - as well as other search engines - have been reluctant to implement it.
To better understand ACAP, its goals and its problems, one first has to examine how the web works today and why publishers feel the need to change it.
The current tool: robots.txt and META tags
Right now, when a search engine like Google, Yahoo!, etc., visits a site, it looks for what is known as a robots.txt file. That file instructs the search engine’s crawler what it can and cannot index. A sample robots.txt file can be found on Google’s site.
A robots.txt file is essentially a list of allowed and forbidden resources. For example, Google’s robots.txt file has allowed other search engines to index its “/searchhistory” directory but forbids other search engines from indexing the “/search” directory. If a search engine were to ignore this request they may, depending on the country, be in violation of copyright law.
Robots.txt files do have a very simple limitation: they only have two options: “allow” or “disallow”. While you can designate spiders by their name, site owners can do nothing beyond just opening or closing the door to the search engines.
To supplement this, some sites use META tags, which are hidden tags within a page that can provide further instructions to the search engines. These tags can tell the search engines to not to follow links on a page or to not cache it.
However, even META tags provide a quite limited set of possibilities. At the most, there are only a handful of options between fully allowing search engine crawling and blocking it. Most of these options are vague and unclear. For example, disallowing search engines could mean preventing spiders from crawling, meaning visiting the page, or disallowing indexing the content on it.
What ACAP attempts to do, then, is expand the protocol to provide more fine-tuned technological control over how the search engines can use a publisher’s content.
The advantages of ACAP
ACAP takes the idea behind robots.txt files - explaining to search engines automatically what they can and can not do with the site - and adds a slew of new instructions, including the following:
- Crawling: Users can clarify their “disallow” statement by forbidding search engines to even visit a page.
- Follow: Tells the search engine to follow or not follow links on the page/directory. Works like the META tag rule but uses a central file.
- Index: This can be set to allow or disallow the search engine to index the content on the page. If disallowed but crawling option is still on, search engines can visit the page and follow links, but not save any content on the page.
- Preserve: Allows or disallows search engines to preserve a copy of the exact resource, instead of just indexing the content.
- Present: This deals with caching and can tell search engines whether or not they can present a copy.
All of the above items can have other details added, including a time limit,a set limit of characters for any snippet of text displayed and permission to include a thumbnail among others. Also, webmasters can direct search engines to take down their own pages from the index if there is an error, or request recrawling at designated times.
The end result of ACAP is that websites, in particular journalism sites, would have complete control over how their content appears in the search engines.
The problem is that, currently, no search engines support ACAP. With Google’s recent refusals, it seems unlikely any will in the near future.
Why search engines dislike ACAP
Currently, there are no search engines accepting or following the ACAP guidelines. Part of this is because although hundreds of sites have implemented the protocol, it is still very limited when you consider the many thousands of journalism-related sites and billions of other sites on the web.
Where robots.txt is a widely-accepted and implemented standard, one that has also been backed by United States courts in lawsuits against Google in the case of Google Cache, ACAP is largely unknown and unused.
Furthermore, ACAP is much more complicated on the search engine’s end than robots.txt. The extra elements mean that spiders have to work harder to determine the wishes of the site owner and to obey them. When multiplied across millions of sites, the implementation of ACAP could be a very expensive proposition for Google and the other search engines.
Currently, there is little motivation for search engines to take up ACAP. There likely will not be an incentive until more news organisations take up the protocol and begin to actively enforce the terms in it.
In the meantime, as Google points out, robots.txt is adequate for most sites.
This leaves news organisations with the decision of whether or not to adopt ACAP on their own sites.
Implementing ACAP
A simple conundrum faces news organisations. Without widespread adoption of the ACAP standard, it is unlikely search engines will ever agree to use it. Without the search engines agreeing to use it, however, there is no reason to adopt ACAP.
The good news is that implementing ACAP is very simple. The ACAP organisation provides a very simple converter for taking a current robots.txt file and making it ACAP compliant. But this converter does not add any of the new elements of ACAP to the file. It just translates the existing file into ACAP-compliant language.

A more advanced ACAP implementation will have to be done by a dedicated IT staff that is capable of converting the implementation guides and converting them to a usable system.
These guides are straightforward but there will likely be some investment of manpower in making happen, especially if the end goal is to use ACAP to its full potential.
The question is whether it is worthwhile to do this on the hope that the search engines will someday accept the standard. That may be a tough sell, especially when news organisations are making cutbacks and dealing with both a slowing economy and the erosion of traditional media.
Important note
It is worth nothing that ACAP only deals with how search engines and other indexes interact with sites and their content. Most copyright infringers, like as RSS scrapers and human plagiarists already ignore robots.txt and meta tags. ACAP would not do anything to remedy those issues.
ACAP is not and was never designed to be a copyright protection system. It is merely a means to relay additional information to search spiders that follow the protocol.
—-
Flickr images from users parl and jpctalbot
Tags: acap, automated content access protocol, google, internet, search engine,
Related articles
- SOS - Save our serendipity
- Facebook IPO – what it means for Zuckerberg and you
- Who likes your page? No, really - Navigating between likes and fakes in social media
- Journalist gaoled, beaten by Syrian authorities offers advice to others
- A critical mass for Public Service Media freedom in South East Europe
- Minority voices on social media networks
- The revolution will be televised, streamed and uploaded
- Google Public Data Explorer
- How social media, internet changed experience of Japan disaster
- How to correct social media errors
EJC Newsletter
Subscribe to our monthly newsletter
Call for Writers
We’re looking for journalists from around the world to report on journalism and media trends and issues. Bring us original insights into innovations or challenges related to print, online, television, copyright, video and mobile journalism. Queries to editors@ejc.net.
Subscribe
Recent Articles
- Facebook IPO – what it means for Zuckerberg and you
- Media and developers team up for Somalia Speaks SMS project
- New tax on subscriptions hits Finnish printed press sector
- The revolution will be televised, streamed and uploaded
- Lithuania seeks to curb its banks’ appetite for media ownership
- Fortune-tellers and psychics pervade Italian media
- Condition ONE: is immersive storytelling the next big step in conflict reporting?
- Public funds for Italian media to be axed by 2013
- How free is the media in Romania?
- 12 tips for international media trainers
Popular Articles
- Wikileaks report reveals corruption in Lithuanian newspapers
- Blogskeptics ponder regulation in Europe
- Books that journalists should read: Edwin Black
- New media and social change in the Arab and Muslim world
- Magazine layouts gain popularity with blogs
- Separating journalism and the media
- The public broadcasting license fee and public value
- Seven simple writing tips for social news
- Discussion Points: Gender equality in the labour market
- The road to journalism: Why we choose to be journalists
Specials

Got something to say?
Share your comments with other journalists