Tuesday, 27 February 2007

Coupons Coupons (1/2)

Let me see if I’ve got this straight, Coupons Coupons.

Trivia: How many bricks are there in Cinderella Castle? Ten thousand? Twenty-eight thousand? A million?

The answer is none. The massive structure is made of a fibreglass body, moulded to look like bricks and strong enough to withstand massive weather and winds. It was reportedly constructed to last “forever”.

Spoiler (one of three poetic forms jockeying for hegemony at the turn of the century (soundbyte, career)) ahead. You’re better served by skimming this potboiler. But I want to see if I’ve got this straight.

1. Search engines arrange their output in order of ‘relevance’ with respect to a particular query. The most ‘relevant’ results appear at the top of the list and are most likely to be clicked on.

2. Seedy web sites exploit the ways in which relevance is formalised in algorithm, to make themselves appear ‘more relevant than they really are.’ Some of these tactics are known as ‘spamdexing’ (because they ‘spam’ a search engine’s ‘index’).

A seedy web site could include text stuffed with certain words, which doesn’t say much with those words: “christian porn xxx christian porno ladies nakie christian jesus porn erotica trumpet christ porno pornography porn for christians song of songs oral bible xxx saviour hardon lesbian lord bible christian sex porn christian porn xxx christian nice porno ladies xxx porn for christ” for example.

Somebody who spots such text might leave the site immediately. I wouldn’t, I think it’s lovely. Moreover there are ways of making it invisible to human visitors, but not to the digital daemons which build search engine databases.

3. Trivia: What is in the Cinderella Castle dungeon? Nothing. The only castle with a dungeon is the one at Tokyo Disneyland, which features a walk-thru attraction including mass graves.

4. Google doesn’t just consider the content of a page in establishing its ‘relevance.’ It takes into account how different page link to each other. The algorithm used is called ‘PageRank.’ On the Google web site it’s described like this: “In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves ‘important’ weigh more heavily and help to make other pages ‘important’.”

6. Aside: There’s a kind of chicken-and-egg / hermeneutic circle problem there. I’m not sure how you determine the rating of any one site, since you need first to know the rating of all the sites linking to it. I think it has to do with making some assumptions (say, that all sites have equal ratings), calculating new ratings based on the link architecture, using these results as your new assumptions for another calculation. Maybe a few iterations. I can’t quite get it clear in my head, but it does seem like different initial assumptions would produce subtly and importantly different orders of priority.

6. “What we’re witnessing here is an ornate set of gears in the spamdexing machinery, a cluster of interlinking web-pages known as a link-farm.” Bearing in mind that PageRank stuff, here’s the little fr33cko!5y5tem that Coupons Coupons uncovered:

a. Somebody hacks onto some web space and stashes a bunch of “link farm” pages there. The main thing these “link farm” pages do is link to each other.

b. Each page in the farm is optimised for a particular search query. Somebody Googles “zanax tablet” and the page http://lesfargues.com/xanax/zanax-tablet.html comes up, because it has a zillion “link farm” pages linking to it. Each of these supporting page features the phrase “zanax tablet” at least once, and each is linked to by a large numbers of pages featuring the phrase “zanax tablet.”

(Well, that’s how it could work. Actually, these pages all feature a piece of code which prevents them from turning up in search results at all. We’ll see why later. Maybe).

c. Aside: Is the link phrasing important, e.g. does zanax tablet look any different from click here for the purposes Google’s calculations?

If it did matter, the link farm which Coupons Coupons would be configured to take advantage of it: “PageRank is determined by the number of links to a single page, and the keywords repeated in the URL, the Header and the Body of (“day next ultram”; see Appendix 5) constitute the search query for which this page is intended to receive a high ranking” (Coupons Coupons). So day-next-ultram.html isn’t just linked to by a lot of pages on which “day next ultram” appears, but linked to by those pages through those very words. This probably simply relates to elegance and orderliness, or ease of automation, but, meh. (Worth noting that while the main ingredient of the relevance calculations, PageRank, is public and patented, there may be further manipulations which are classified as trade secrets. I’ll poke around for more information about that).

d. Someone also hacks around and hides similar “link farm”-style hypertext on other existing, “legitimate” pages (like the Bureau of Public Secrets pages). This is one of the bits I particularly want to get straight: why? Messing with someone else’s content, however invisibly, seems riskier than stashing your shit in their spare capacity. Furthermore, these pages don’t redirect (see e), do they? If they do redirect, they’re likely to be spotted by web masters; if they don’t, they only fulful one of the functions of the other link farm pages. Why have these pages and their hidden spam data sets at all, why not just have the link farms?

Perhaps Google runs some incest detection software; the link farm pages described in b might work to create a killer PageRank profile, and the ones hybridised with existing pages might assuage the suspicion of such software by bringing into the calculations a variety of domain names, file sizes and keyword densities. (Or perhaps it is an accident of convention or convenience: maybe it is easier or safer to add to existing pages than to upload new ones?).

(Imagine a simple piece of anti-spamdexing code running random spot checks. It could work like this. Every now and then a page is picked from the top ten results, and its PageRank tested against a hypothetical PageRank(2), generated by considering what the page’s PageRank would have been in a randomly-assembled milieu of pages, rather than the milieu of pages on which the search term appears. Though PageRank(2) will certainly be lower, norms would emerge as to how much lower. Supranormal discrepencies between a page’s PageRank and its PageRank(2) could be flagged for human investigation, or more sophisticated automated diagnosis. This gizmo would probably flag the kind of link farm described in b but not with the addition of the outlying spam data sets mentioned in this section).

e. The point of all this is to get a Google user to a “scraper site.” The scraper site makes the hacker her money (rather confusingly, by advertising – through AdSense, a scheme also run by Google – “legitimate” vendors of the drugs in question).

But this cloaking and scraping shizzle confuses me. For example: “[…] link-farm pages never show up in search engine results because of a simple meta-tag addressed to spiders […] [Another script, on a link farm page] is intended to redirect the user to […] a pseudo-search engine containing planted ads linking to online companies.” If link farm pages don’t show up in search engine results, why is the user there in the first place?

(& does Coupons Coupons verily mean, “More on cloaking here:
http://gtresearchnews.gatech.edu/newsrelease/spam-data.htm”?)

My best guess is: when a wee automated fiend from Google visits a link farm page, there is a bit of code waiting for him there to kidnap, blindfold and brainwash him. He is told the link farm page doesn’t exist. He is told that he is instead being taken to the address of the scraper site. Then he is shown around the link farm page. When he reports back to Google, he tells of everything he saw at the link farm page (the keywords, the outgoing links to other link farm pages), but claims it’s all located at the scraper site. (This doesn’t ring true though. Does that mean for every link farm page, there’s a corresponding discrete scraper site address, going to Google in its stead? And, scrape goat. And how is the link structure then preserved, PageRank-apt, if the true locations of everything is hidden? No no no no no).

(What is the point of doing this, rather than just having a bit of code in the link farm page to redirect the user to the scraper site? Presumably because link farms would be easier for Google to detect and root out from its searchable cache? Would scraper sites be any more difficult?)

This is SUPERB WORK, I think it's the most exciting piece of literary theory I've read since Keston Sutherland's unloved Abu Ghraib essay in Quid. I’d like to characterise my having of Coupon Coupon’s back. Hers is a back I have in the same strained and wistful manner in which an elderly upper class Englishwoman considers herself to be rather splendid. Just to be clear: the Englishwoman’s only preternatural power is a kind of totalising good grace which represses almost to nothing her dissatisfaction at the little mix-up that has landed her in a world that for all its sugar one must admit is very beastly and a bit drab.

I'll have another go at understanding the scraper stuff, & then more comments. Soon I hope.

1 comment:

Kismet Jones said...

The way I read it is this: these link farms exist merely to optimize search engine results. I think that they're false sites & as such have no need to surface in legitimate terms. Hence the cloaking. These farms are not a legitimate linking system because they do not want & have no need to be seen in search results (or if they did, they'd only redirect you to another site & would therefore be seen to do so): they create links as a spider spins its 'invisible' web. On p.8 of the Coupons#Coupons article, for example, the HTML code cited is that used by our old blog template to create links. These links show up in Google searches. A code is applied which renders their manifestation in the Google search null a void, & the web page (i.e., the Google page showing the results) doesn't list it, although the underlying technology does register it. A kind of inverse analogue here would be with 'invisible content' (opening discussion in Coupons article), where code is used to have results appear in search results page, but not in originating page. As with link farms, originating page(s) are used to optimize search engine. However, link farms are set up to send surfers, or drug-seekers, packing to the dispensary (scraper site). Again HTML code is used to address or fool the 'robots' & JavaScript code to redirect. The purpose of the farms seems to be to create invisible or hidden search results which direct the surfer to the scraper site(s).