Does Google (and other companies that scrape

posted by Jason Kottke   Oct 07, 2003

Does Google (and other companies that scrape sites) violate the terms of many weblogs’ Creative Commons licenses?. Or for that matter, plain old copyright notices?

Matt HaugheyOct 07, 2003 at 6:11PM

That’s a noodle scratcher. I would guess that Google would say anyone that wants their content out of Google’s ad-filled search result pages can easily block the googlebot using meta tags.

markOct 07, 2003 at 9:03PM

robots.txt as well…

SeldoOct 08, 2003 at 6:53AM

Copyright violation is not an opt-out process; your copyright shouldn’t be violated unless you ask people not to. Google’s cache is *very* legally dubious, and has been repeatedly questioned. But search engines just can’t work without making copies of pages; all Google does is make its own more visible than most. And search engines are far too useful to make them illegal. So everyone just tiptoes quietly around the issue.

Cory DoctorowOct 08, 2003 at 9:42AM

Keep in mind that there’s a lot of copyright that doesn’t belong to the author, but rather to the public: the fair uses, first sale, assistive transcoding rights, etc. A lot of these rights are case-specific. Creative Commons licenses explicitly state that they are a grant of rights in addition to the public’s rights in copyright, not instead of.

The ambiguity in fair use is a feature, not a bug. Before the VCR was invented, no one knew that there was such a thing as a time-shifting use, so no one could possibly have enumerated it on the list of fair uses (but a court was able to legalize the copying of 100 percent of a work in a way that substantially undermined one of the author’s potential revenue sources after a few years of Universal v Sony).

Before the Web, there wasn’t much clue that a Google Cache or an Internet Archive could exist some day, so it isn’t likely that their activities would have been enshrined in fair use before they became a common practice.

Conversely, there’s no good reason to assume that now that these are common practices that they won’t become enshrined in law, since they provide such a benefit to the general public.

MartinOct 08, 2003 at 10:00AM

I posted on The Copydesk about this a long time ago.

As far as I’m concerned, Google indexing my site is fine, but cacheing content on a server (in the event of my site being down) is a breach of my rights. It’s effectively re-publishing my content without my permission.

However, a simple noarchive in the HEAD should sort them out - but you shouldn’t have to do this - they should ask before they borrow.

Cory DoctorowOct 08, 2003 at 11:07AM

“Ask before you borrow” is the opposite of fair use. Can you imagine a mechanism whereby something like the Internet Archive or the Google Cache could be built in an ask-first regime?

MartinOct 08, 2003 at 11:26AM

It’s not about convenience. It’s about rights. But it’s also about power too.

Remember what happened with the Amazon API that looked like Google’s page?

Google went nuts and sent them a cease and desist - they were forced to change the design.

That’s how Google reacts to people republishing their content or re-using their ideas.

JustinOct 08, 2003 at 5:15PM

Martin, they can’t index your page without storing it on their computer. I don’t see it as being any different from a browser cache or a proxy server. If you are that worried about your copyrighted material. Then perhaps the internet is not the place to be displaying you work.

BTW, you last line is illogical, since that is a different use from a cache.

David GalbraithOct 08, 2003 at 5:46PM

Hi, just to clarify - the point was not really to do with copywrite, searching, robots.txt files or the general issue of fair use.
To put it in simple terms: what happens when you read the full text (not headline summary) of articles from a site that makes money from text ads in an aggregator that puts its own text ads alongside i.e. the aggregator takes away the content providers revenue and replaces it with its own, even when the content provider says that you can’t use the content for commercial purposes?

David GalbraithOct 08, 2003 at 5:47PM

Sorry - copyright - no pun intended

Cory DoctorowOct 08, 2003 at 8:26PM

Er, copyright rights *are* about convenience. Or more to the point, about a realpolitik that balances the interests of the public good (things like the general value to us of things like a Google Index) and the interests of rights holders (the moral indignation at the thought of Google making a microcent off of ads on a page that excerpts your work). We rejig copyright all the time, especially when technology and copyright bash into each other — piano rolls, radio, cable TV, VCRs, Internet radio, etc and so forth. If the piano roll compulsory (pay a penny, get the rights to make a recording of any song) isn’t about convenience, then what is it about?

David GalbraithOct 08, 2003 at 8:44PM

Sure - I use search engines all the time - and they add value, and I click on text ads because unlike swathes of spam, they are targeted and relevant. But here, fair use and convenience is largely predicated on the fact that search engines drive traffic to content sites, via headlines. The original point about Google was slightly tongue in cheek, i.e. if Google wants to get particular about Adsense then content providers could also get particular.
The point where the balance tips and the relationship doesn’t work is where you have full text content being syndicated and new ads being placed by the aggregator - then there is a problem, and a revenue split between the aggregator and content provider would be logical.

MartinOct 09, 2003 at 7:04AM

Cory - my answer about convenience was in relation your point: “Can you imagine a mechanism whereby something like the Internet Archive or the Google Cache could be built in an ask-first regime?”

I can imagine it - and ultimately, it is the responsibility of people like Google and the Internet Archive to ask before they cache my content, not the other way around.

It’s not their property in the first place.

Web/proxy caching to save on bandwidth is an entirely different matter - that is about convenience, and in the end, such cacheing aims to deliver web content in the original manner for which it was intended (and usually intact in its original form - free from any bannerage).

It’s not my responsibility to assume that my own material is fair game and up for grabs by big organisations, simply because it’s inconvenient for them to go around asking if they can copy it.

By the very creation and publication of original written material (or otherwise), I have basic legal and moral copyrights to the content I have produced - chief among which is the right for my material not to be reproduced without my permission.

As far as I can see, Google and the Internet Archive breach this right.

Cory DoctorowOct 09, 2003 at 8:59AM

I can imagine it - and ultimately, it is the responsibility of people like Google and the Internet Archive to ask before they cache my content, not the other way around.

That’s a profoundly ahistorical perspective, and I imagine it will be as effective against today’s changing copyright norms as the Vaudeville artists’ lawsuit against Marconi for his infrigning radio was. That is to say: not at all.

Copyright changes whenever a cool new technology makes an old part of copyright law unfeasible. The way that that change takes place is: someone invents a cool new way of infringing — like sound recordings, or cable television, or VCRs, or Internet Radio — and then it becomes wildly popular, and then some body of competent jurisdiction — Congress, a court, the FTC — rewrites copyright law on the strength of the coolness of the new thing.

If you live in America, you have no moral right at all to copyright. There is simply no such thing as a moral right for authors in this country. There never, ever has been. Every court that’s ever been asked to rule on an author’s moral right has ventured an opinion that can be summed up as “Tee hee hee, yeah right, whatever.”

Of *course* they’re infringing your copyright. That’s because it would be impossible to make an Internet Archive or a Google Cache if there was a permission requirement along the way. So they’re doing it in the only way possible, and they’ll face the music when someone who can show actual damages (as opposed to mere indignation) wants to try bringing them to court, and then they’ll fight it up to the Supreme Court and win a new fair use or they’ll have a Congresscritter amend the Copyright statute to explicitly permit archiving.

Or they won’t. And we will have burned down two of the largest, most useful libraries of human endeavor — because of a notional “moral right” that has served, in this instance, to deny posterity to millions of authors by erecting such roadblocks before the creation of an effective Internet library that no one can seriously undertake such an effort.
Unless you are a very old net.hand indeed, your Web-page post-dates the practice of spidering, cacheing, and copying the Web for archival and search purposes. You knew the rules of the road when you decided to set out on the journey — why do you think that the rules should be rewritten now to accomodate the norms of some older medium?

