There is a lot of apparently bad / old / inaccurate information out in the world about how to block archive.org, otherwise known as "The Wayback Machine" from scraping your site. This is the most accurate information we can find about it as of this writing. Spoiler alert: Internet Archive did remove our site once we asked, but the robots.txt method did not work.
ia_archiver is not the Archive.org bot
IA_Archiver is a bot for Alexa. It is apparently no longer a bot for archive.org. How do we know? The screenshot below comes from this Alexa webpage.
That means that if you use robots.txt exclusion like this:
User-agent: ia_archiver Disallow: /
It will not disallow Archive.org (Wayback Machine) but will instead block Alexa from crawling your site.
Is there a connection between Archive.org and Alexa?
Yes. They were created by the same guy. According to Wikipedia:
The Wayback Machine was created as a joint effort between Alexa Internet and the Internet Archive when a three-dimensional index was built to allow for the browsing of archived web content." and that "Brewster Kahle founded the archive in May 1996 at around the same time that he began the for-profit web crawling company Alexa Internet.
Why did Archive.org stop respecting robots.txt?
The folks at Archive.org said that robots.txt files don't serve the purpose of an archive site. You can read their post about it here, but one of the important points they claim is:
"Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes."
While they seem very keen on fulfilling their purposes, they seem to have overlooked the wishes of website owners who do not want their intellectual property scraped and displayed.
Why does everyone think ia_archiver is an archive.org bot?
Because it used to be. According to the now defunct archive.org exclusion page:
The Internet Archive is not interested in offering access to web sites or other Internet documents whose authors do not want their materials in the collection. To remove your site from the Wayback Machine, place a robots.txt file at the top level of your site (e.g. www.yourdomain.com/robots.txt).
The robots.txt file will do two things:
To exclude the Internet Archive’s crawler (and remove documents from the Wayback Machine) while allowing all other robots to crawl your site, your robots.txt file should say:
Ironically, you can still see the defunct exclusion page on WayBack machine.
It might be assumed that the people at archive.org have changed their minds and that now The Internet Archive is interested in offering access to web sites or other Internet documents whose authors do not want their materials in the collection.
Ia_archiver used to work
So you see, the correct way to stop archive.org from copying your site was to add ia_archiver to the robots.txt disallow file and no longer is. Since only Webmasters are supposed to have editing access to a sites robots.txt file this seemed like a pretty good way to do it. But then archive.org quietly changed things and everyones content started to be scraped again. Bummer.
If ia_archiver no longer works, what does?
How can I exclude or remove my site's pages from the Wayback Machine?
You can send an email request for us to review to email@example.com with the URL (web address) in the text of your message.
But when you send them an email with the requested information there is no reply, at least not immediately. We tested it and found that there is in fact no auto-reply, so it seems a bit like shouting into a hole in the ground.
Why archive.org would want to deal with this manually instead of just letting Webmasters make their own decisions about copying their content using a robots.txt file is a mystery. It seems a rather tedious solution if it even works. Some say it works like a charm, others say they've sent multiple messages to the email address and have gotten no response weeks or months later.
An email to Internet Archive *does* work
We emailed Internet Archive. While we did not receive an automatic response, they did respond to us about a week later. Below is the email they sent.
Some say that archive.org_bot may work
Some users suggest switching out the old ia_archiver disallow for a new archive.org_bot disallow. We haven't been able to verify if this works yet. Many say it doesn't. If you want to try it, here is the robots.txt info you'll need:
User-agent: archive.org_bot Disallow: /
You may be able to use your .htaccess file to block archive.org
The Apache web server can use an .htaccess file to store directives. You can find instructions on how to do it here. You'll need the IP addresses of the archiver bot. You can find the IP addresses of the Archive.org bots here.
We haven't tried this method, and you'll need to be a little bit technical to do it. As with anything at the server level we counsel people to be aware of their limits and to hire a pro if you can't comfortably manipulate things at the server level.
Is it illegal for archive.org to scrape without permission?
According to the Electronic Frontier Foundation it is perfectly legal to scrape publicly available content. They cite a Washington DC case and say:
automated tools to access publicly available information on the open web is not a computer crime—even when a website bans automated access in its terms of service.
This even applies if the Terms of Service say explicitly that a user cannot scrape the site. LinkedIn once brought a lawsuit against people scraping their site in violation of their terms of service - and lost. You can find an article about the case here. It says:
[The ruling] holds that federal anti-hacking law isn't triggered by scraping a website, even if the website owner—LinkedIn in this case—explicitly asks for the scraping to stop.
Using a DMCA notice to remove archive.org
We haven't finished verifying whether this works or not, but we will update this blog post when we do.
Thanks for reading. If you have anything else to add, please do so in the comments section below.