How to Block Archive.Org and Erase Web History

Archive.org will remove your website content from their system – which is good since it is your content, and you should have that right. Sometimes, when performing reputation management for a client, we must remove the client’s site from Wayback Machine to delete outdated information from the internet.

For example, we have removed sites that have been hacked and had been injected with defamatory information about people. In other cases, they contained people’s illegal personal information. By removing the site from Archive.org, the problem is solved.

Here is how it is done.

Why does everyone think ia_archiver is an archive.org bot?

Because it used to be. According to the now-defunct archive.org exclusion page:

The Internet Archive is not interested in offering access to websites or other Internet documents whose authors do not want their materials in the collection. To remove your site from the Wayback Machine, place a robots.txt file at the top level of your site (e.g. www.yourdomain.com/robots.txt).

The robots.txt file will do two things:

  1. It will remove documents from your domain from the Wayback Machine.
  2. It will tell us not to crawl your site in the future.

To exclude the Internet Archive’s crawler (and remove documents from the Wayback Machine) while allowing all other robots to crawl your site, your robots.txt file should say:

User-agent: ia_archiver
Disallow: /


Ironically, you can still see the defunct exclusion page on Wayback machine.

old_archive_org_exclude_page

Ia_archiver method used to work

So you see, the correct way to stop archive.org from copying your site was to add ia_archiver to the robots.txt disallow file, and it no longer is. Since only Webmasters are supposed to have editing access to a site’s robots.txt file, this seemed like a pretty good way to do it. But then archive.org quietly changed things, and everyone’s content started to be scraped again. Bummer.

If ia_archiver no longer works, what does?

According to archive.org, the best way to remove a site is to send an email to info@archive.org and request they remove it. The exact language they use is:

How can I exclude or remove my site’s pages from the Wayback Machine? You can send an email request for us to review to info@archive.org with the URL (web address) in the text of your message.

But when you send them an email with the requested information, there is no reply, at least not immediately. We tested it and found that there is, in fact no auto-reply, so it seems a bit like shouting into a hole in the ground. 

Why archive.org would want to deal with this manually instead of just letting Webmasters make their own decisions about copying their content using a robots.txt file is a mystery. It seems a rather tedious solution if it even works. Some say it works like a charm; others say they’ve sent multiple messages to the email address and have gotten no response weeks or months later. 

An email to Internet Archive *does* work

We emailed the Internet Archive. They responded to us about a week later. Below is the email they sent. 

archive-email

Some say that archive.org_bot may work

Some users suggest switching out the old ia_archiver disallow for a new archive.org_bot disallow. We haven’t been able to verify if this works yet. Many say it doesn’t. If you want to try it, here is the robots.txt info you’ll need:

User-agent: archive.org_botDisallow: /

You may be able to use your .htaccess file to block archive.org

The Apache web server can use an .htaccess file to store directives. You can find instructions on how to do it here. You’ll need the IP addresses of the archiver bot. You can find the IP addresses of the Archive.org bots here.

We haven’t tried this method, and you’ll need to be a little bit technical to do it. As with anything at the server level, we counsel people to be aware of their limits and to hire a pro if you can’t comfortably manipulate things at the server level.

Is it illegal for archive.org to scrape without permission?

According to the Electronic Frontier Foundation it is perfectly legal to scrape publicly available content. They cite a Washington DC case and say:

automated tools to access publicly available information on the open web is not a computer crime—even when a website bans automated access in its terms of service.

This even applies if the Terms of Service say explicitly that a user cannot scrape the site. LinkedIn once brought a lawsuit against people scraping their site in violation of their terms of service – and lost. You can find an article about the case here. It says:

[The ruling] holds that federal anti-hacking law isn’t triggered by scraping a website, even if the website owner—LinkedIn in this case—explicitly asks for the scraping to stop.

Using a DMCA notice to remove archive.org

You may be able to create a DMCA takedown notice using a generator like this one. And then email the notice to the nice folks at info@archive.org. 

We haven’t finished verifying whether this works or not, but we will update this blog post when we do. 

Thanks for reading, and good luck!

FAQs

Is there a connection between Archive.org and Alexa?

Yes. Archive.org and Alexa were created by the same person.

Why did Archive.org stop respecting robots.txt?

The folks at Archive.org said that robots.txt files don’t serve the purpose of an archive site.

How do you remove a site from Archive.org?

According to archive.org, the best way to remove a site is to send an email to info@archive.org and request they remove it.

Is it illegal for archive.org to scrape without permission?

According to the Electronic Frontier Foundation, it is perfectly legal to scrape publicly available content.

 

Tags: Business Reputation Repair, Individual Reputation Repair, Online Reputation Management Services, Online Reputation Repair, Reputation Marketing.

Ready to Take the Next Step?

Get in touch with our team and we’ll take the first steps toward making you look better online.

Talk with Us