May 1, 2008

blekko is hiring

blekko is building a new search engine from scratch and I'm looking to hire a few more coders.

Search is an absolutely fascinating problem to work on for a bunch of reasons. For one thing you have to scale the thing before getting the first user. You can't just start with a server or two and add more when the users come. Step 1 is to copy the internet onto your cluster. Step 2 is to analyze it..

The componentry is remarkably deep.

Search is like 7 hard problems wrapped into a stack. Distributed systems, html analytics, text analytics/semantics, anti-spam, AI/ML, frontend/UI. And scale... Apart from the sexy high end algos there are also the boring 10-year old system libraries and off-the-shelf tools that crack under stress and sometimes need a look. You open the hood and wonder how the thing ever worked in the first place...

Plus there is always something fresh and new every day mining through the vast sordidness of the many billions of pages on the web. You expect to be amazed at the endless varieties of crazy porn domains and new approaches to webspam. But there are equal horrors in the small, finding pathological charset issues, previously-undiscovered abominable server implementations, psychopathic website owners. The web is a reactive fuzz test.

I know there are some great coders out there reading this blog who would have blast working on some of the pieces here that need to get built. This is a great opportunity to join an experienced team early building a big system from the ground up. If you think you might be interested, send me an email and we can chat.

fyi our interviews always have coding tests. Primarily we are looking for folks who love to write code and are good at it. :)

How Fake Luxury Conquered the World

The legend says that once upon a time there was a General Motors. This General Motors, GM for short, had a car and a brand for every need, along the plan developed by the great Alfred Sloan prior to the Second World War. There were Chevrolets for regular folk, Pontiacs for the cautious old people (and, thanks to John Z. Delorean's development of the 1964 GTO, for angry young people as well), Buicks and Oldsmobiles for doctors and successful businessmen, and Cadillacs at the very top, for the most successful men in the land.
...
It would have stayed that way forever, but one day a mysterious yet important man at GM had a mysterious yet important idea: Executives should drive cars from their own division!

Which leads to every division of GM building their own version of the Cadillac.

Read more: How Fake Luxury Conquered The World

(thanks Bryn for the tip)

April 24, 2008

Microsoft bias in MSN search results, surprise

I was looking to see what search sites might have a particular bug that I (ahem) came across and was trying the search for the number 0 in various places. There is a pretty good Wikipedia page about zero. Zero has a rich and interesting history and there are many other potentially reasonable results.

But I was surprised to see MSN search had demoted their good results below some crappy ones from MSDN:

Lame! Falling into an inferior lex position and a lower overall relevance page to boost their own network results...give em credit for being old school. :)

...

I found my bug on Yahoo Search. I had tried a lot of smaller engines first because I didn't think a major would have this bug. You can't search for 0 on Yahoo. You can search for all the other numbers, but not 0 ...

Why?.. Because 0 is false. It suggests Yahoo is using a scripting language to front their search form, and a programmer did something like if ( $query ) rather than if ( $query ne '' ).

April 22, 2008

Hypertable architecture talk Wednesday in Palo Alto

Doug Judd will be discussing the internals and architecture of Hypertable tomorrow in Palo Alto at 6:30pm.

Hypertable is an open source, high performance, distributed database modeled after Google's Bigtable. It differs from traditional relational database technology in that the emphasis is on scalability as opposed to transaction support and table joining. Tables in Hypertable are sorted by a single primary key. However, tables can smoothly and cost-effectively scale to petabytes in size by leveraging a large cluster of commodity hardware. Hypertable is designed to run on top of an existing distributed file system such as the Hadoop DFS, GLusterFS, or the Kosmos File System (KFS). One of the top design objectives for this project has been optimum performance. To that end, the system is written almost entirely in C++, which differentiates it from other Bigtable-like efforts, such as HBase. We expect Hypertable to replace MySQL for much of Web 2.0 backend technology. In this presentation, Doug will give an architectural overview of Hypertable. He will describe some of the key design decisions and will highlight some of the places where Hypertable diverges from the system described in the Bigtable paper.

More details.

Starbucks "re" branding

It will be interesting to see how the return of the original starbucks founder Howard Schultz and the return to their orig plan and ideas turns out.

He's had a successful stunt with the system closing for 3 hours to retrain workers in how to make coffee, which generated a lot of PR.

Now the introduction of the new house blend, named after the original starbucks store. But also, surprise! - the original logo is back.

Usually logos and identities get vaguer, cleaner and more abstract as a the MBAs wash/rinse/repeat. Starbucks is going back to the gritty and vaguely obsene logo they launched with.

 

Deadprogrammer famously detailed the history of the Starbucks logo going back to a 15th century woodcut. The original logo was slightly sanitized, but each corporate revision made it more and more abstract and less recognizable as to what it actually was. My wife said "I had no idea there was even anything inside that circle, I had never looked until you pointed it out to me."

Face logos are great brands but they always seem to get watered down and more cartoony over time. This is the case with a lot of the face logos on food at the grocery store, the original versions were closer to actual faces rather than abstract logos (think chef boy r dee here.)

This happened to KFC with the colonel...he started out as realistic line drawing of Colonel Sanders with the company name - "Kentucky Fried Chicken." After the waves of rebranding stylists were done with him he was an abstract cartoon. They couldn't stop there and abbreviated the company name. You're wouldn't want to realize you're eating FRIED CHICKEN when you're at KFC after all. You probably want to be eating a healthy salad with dressing on the side. That's why you went in there, right??

I bet Dunkins Donuts wishes they could rename themselves "DD". Hmmm, maybe "empty vessel" names aren't so bad after all... :)

Interesting to think about brand identities that get going because they're a little gritty and different and personal, they don't start out whitewashed / washed out, but after getting successful they put on the bland suit. What would the AOL redesigners do to Drudge's site if they bought it?

April 16, 2008

Microsoft "hits back" at Google with re-launch of 4-year old Newsbot

The memecrowd sure has a short memory... maybe I'm just showing my age here, but still.
CNET: Microsoft hits back at Google with Live Search News
Search Engine Land: Microsoft Launches Live Search News
Search Engine Watch: Windows Live Search Offers Google News Alternative

MSN Newsbot? Anyone? From 2004:

CNET: Google News faces Microsoft rival (Jul 27, 2004)
Wash Post: Microsoft Deploys Newsbot To Track Down Headlines (Aug 1, 2004)
Geeking with Greg: MSN Newsbot review (Jul 27, 2004)

Web robot names considered, and rejected

Google's is "Googlebot"
Yahoo's is "Slurp"
Cuill's is "Twiceler"

It makes sense have a friendly robot user agent, so nervous webmasters won't ban it. You don't want to call your crawler 'sitejacker' or something.. Unfortunately my favorite candidates were:

Crawlhammer
Webraker
Lurchy
Client9

hmmm. :-|

"Oh no! It's CrawlHammer!!"

If even in your heart you hide the urls ... there it shall rake for them...

...

Does anyone know what the purpose of a '+' in front of an url in the robots user-agent is? Some sites put in the '+', others don't...

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Mozilla/5.0 (compatible; Ask Jeeves/Teoma; +http://about.ask.com/en/docs/about/webmasters.shtml)

Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Mozilla/5.0 (Twiceler-0.9 http://www.cuill.com/twiceler/robot.html)

Gigabot/3.0 (http://www.gigablast.com/spider.html)

April 14, 2008

Cluster map propagation in Amazon Dynamo

Dynamo is Amazon's scalable key/value storage service. The paper is a good read, but I found the way the cluster node list information was propagated in dynamo to be a little odd. The algorithm is that every 60 seconds a node will talk to another node in the cluster, chosen at random, and exchange update information. I wondered how fast a change would propagate through the cluster, so I simulated the propagation.

For a 5,000 node cluster it takes about 9 update cycles for a change to reach every other node. Since each update is on a 60 second timer, that's 9 minutes for a change to push out.

I didn't do a very sophtisticated time model..plus there is random start and all that. So maybe in practice it's a little different. But 9 minutes seems like a long time to propagate a host change out to the rest of the cluster. Maybe I mis-interpreted what they're doing?

I recall some confusion about whether Dynamo was actually providing SimpleDB, or if they were two separate software systems. Does anyone know if this was resolved?

April 9, 2008

AppEngine - Web Hypercard, finally

Google's AppEngine is being compared to Amazon's EC2/S3. But Google deserves credit here for coming up with a pretty differently-positioned product. There may be overlap for many users of course, but it's really operating at a whole different level of the stack.

Folks that want/need more control over the environment, ability to manually manage their own machine instances, run code other than python, etc. will stay with EC2. EC2 is a step above RackSpace.

But rather than thinking of AppEngine as a step above EC2, instead I think of it somewhere around Myspace. Or "Ning 1.0", as Zoho points out.

In the beginning was GeoCities... No, even further back, in the beginning was Hypercard. Hypercard was a pre-web application for Macs that let you design a "stack" of pages - a website on a floppy, really. Popular stacks got traded far and wide. Hypercard stacks existed for every imaginable purpose - "Time Table of History", games, crossword puzzles, the Bible, etc.

The thing about Hypercard was that it wasn't just static text and images like base html. It had a scripting language, a database, and the Apple UI built-in, so you could create mini applications.

It feels like the web has been trying to claw its way back to the simple utility of Hypercard ever since Mosaic. GeoCities was the first massive-uptake anyone-can-build-here website haven. But it was all static html.

Sure, you can paste javascript widgets onto your page, and have content driven by external sites. But to make the website a first-class object - on functional partity with a "real" website - it needs to be backed by a database and programmability. But setting up mysql, renting machine space, configuring linux, programming all the boilerplate, not to mention the scalability issues if your site gets popular -- this is all a big hurdle.

So to hide all those details behind a platform that's easy to get started with, and lower the bar to entry to writing public application websites... Well that's a big deal. Hat's off to Google for bringing this to market.

I'm not alone...somewhat similar thoughts from Nate Westheimer...

April 8, 2008

Cuill is banned on 10,000 sites

Be careful while you debug your crawler...

Webmasters these days get very touchy about letting new spiders walk all over their sites. There are so many scraper bots, email harvesters, exploit probers, students running Nutch on gigabit university pipes, and other ill-behaved new search bots that some site owners nervously huddle in forum bunkers anxiously scanning their logs for suspect new vistors, so they can quickly issue bot and ip bans.

Cuill, the search startup from ex-googlers anticipated to launch soon seems to have run a rather high rate crawl when they were getting started that generated a large number of robots.txt bans. Here is a list of sites which have banned Cuill's user-agent "Twiceler".

A well-behaved crawler needs to follow a set of loosely-defined behaviors to be 'polite' - don't crawl a site too fast, don't crawl any single IP address too fast, don't pull too much bandwidth from small sites by e.g. downloading tons of full res media that will never be indexed, meticulously obey robots.txt, identify itself with user-agent string that points to a detailed web page explaining the purpose of the bot, etc.

Apart from the widely-recongnized challenges to building a new search engine, sites like del.icio.us and compete.com that ban all new robots aside from the big 4 (Google, Yahoo, MSN and Ask) make it that much harder for a new entrant to gain a footing. However the web is so bloody vast that even tens of thousands of site bans are unlikely to make a significant impact in the aggregate perceived quality of a major new engine.

My initial take was that this had to be annoying for Cuill. As a crawler author, I can attest that getting each new site rejection personally hurts. :) But now I'm not so sure. Looking over the list, aside from a few major sites like Yelp, you could argue that getting all the forum seo's to robots exclude your new engine might actually help improve your index quality. Perhaps a Cuill robots ban is a quality signal? :)

April 7, 2008

Did Powerset outsource their crawl?

I've been seeing Zermelo, Powerset's crawler hitting my pages. Sort-of:

ec2-67-202-8-249.compute-1.amazonaws.com - - [28/Mar/2008:23:31:06 -0700] "GET /2006/12/scale_limits_design.html HTTP/1.0" 200 11526 "http://www.skrenta.com/2006/12/i_took_a_ukulele_lesson_once.html" "zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]"

They're using the open-source Heritrix crawler, running out of Amazon Web Services. But who is page-store.com? From their site:

Vertical search sites are relatively costly to operate. A single vertical search engine may need to sweep all or a large part of the web selecting the pages pertinent to a small set of topics. Startup and operating costs are proportional to the input page set size, but revenue may be only proportional to the size of the selected subset.

Page-store positions itself as a web wholesaler, supplying page and link information to vertical search engine companies on a per-use basis. The effect is to level the playing field between vertical search and general horizontal internet search.

Page-store can provide

  • selected page feeds based on deep web crawls
  • page metadata
  • black-box filters
  • anchor text results
  • link information

Did Powerset outsource their crawl?

March 12, 2008

NFS server %s not responding still trying

:)

March 6, 2008

Who will stop Google from going to 90% market share?

Jason predicts Google going to 90% market share.. He makes a solid argument and covers the bases. Referred traffic today suggests Google is at about 85%. Ask just quit the game, msn/yahoo put themselves into a tarpit. So the field is Google's...

The only thing that can change this are new players. A string of uninteresting search attempts and lackluster competition have convinced people that it's impossible to stop Google's ascent.

Google may have a network effect on ads, but the switching costs for the search app itself are small. Easier than switching free email providers. It's just another content site, and users are willing to try new search engines. There just haven't been any interesting new ones to try in a long time.

I was hopeful that Wikia would launch something interesting and break the n-game losing streak of the upstarts, but sadly it was another shallow effort.

I'm rooting for Cuill next. They have a very credible team. Anna built the current version of Google, and now she's working on the next gen. If they launch something interesting in any dimension, they'll show the market that you don't need a million servers and half of the phd's in the field to build a search app. It takes 20 people and $5M of hardware...if you know what you're doing.

February 27, 2008

The real reason Google's clicks are flat

From SEO Black Hat:

Google reduced the clickable area on Adsense text ads ... Before, a user could click anywhere on the ad and be brought to the destination. After the changes, users have to click on something that looks like a hyperlink.

"The CTR on text ads declined about 60% in the last 2 months with Googles changes, Image ads on the other hand stayed the same."
- January 4th, 2008 Marcus of Plentyoffish.com

4 months later, that little back and forth in the Google Rec Room shaved about $85 Billion (with a B) in market capitalization.

But it wasn't as stupid an idea as it might seem. You see, Adsense works in a Quasi-market place environment. The market will bid up the cost per click once the adjustment for accidental clicks is readjusted. Right now, marketers should be getting a better value per click as a higher percentage of the clicks are "real" or intentional. That will lead to higher bids per click and ultimately should be close to a break even for GOOGs bottom line.

Is the Sky Really Falling?

The problem is that in the interim, GOOG gives almost not Guidance to the stock market. Mutual Fund types are really too thick to grasp exactly what's going on, so they think that this "slowing" in the growth has to do with the potential recession effecting GOOG.

Meanwhile, the real story is that Online Advertising Spending will continue to grow at about 30% per year for at least the next 3 years and GOOG is poised to take a disproportionate amount of that growth even if nothing else they do is even marginally successful.

February 22, 2008

Lamport's Bakery Algorithm

This paper describes the bakery algorithm for implementing mutual exclusion. I have invented many concurrent algorithms. I feel that I did not invent the bakery algorithm, I discovered it. Like all shared-memory synchronization algorithms, the bakery algorithm requires that one process be able to read a word of memory while another process is writing it. (Each memory location is written by only one process, so concurrent writing never occurs.) Unlike any previous algorithm, and almost all subsequent algorithms, the bakery algorithm works regardless of what value is obtained by a read that overlaps a write. If the write changes the value from 0 to 1, a concurrent read could obtain the value 7456 (assuming that 7456 is a value that could be in the memory location). The algorithm still works. I didn't try to devise an algorithm with this property. I discovered that the bakery algorithm had this property after writing a proof of its correctness and noticing that the proof did not depend on what value is returned by a read that overlaps a write.

I don't know how many people realize how remarkable this algorithm is. Perhaps the person who realized it better than anyone is Anatol Holt, a former colleague at Massachusetts Computer Associates. When I showed him the algorithm and its proof and pointed out its amazing property, he was shocked. He refused to believe it could be true. He could find nothing wrong with my proof, but he was certain there must be a flaw. He left that night determined to find it. I don't know when he finally reconciled himself to the algorithm's correctness.

...

What is significant about the bakery algorithm is that it implements mutual exclusion without relying on any lower-level mutual exclusion. Assuming that reads and writes of a memory location are atomic actions, as previous mutual exclusion algorithms had done, is tantamount to assuming mutually exclusive access to the location. So a mutual exclusion algorithm that assumes atomics reads and writes is assuming lower-level mutual exclusion. Such an algorithm cannot really be said to solve the mutual exclusion problem. Before the bakery algorithm, people believed that the mutual exclusion problem was unsolvable--that you could implement mutual exclusion only by using lower-level mutual exclusion. Brinch Hansen said exactly this in a 1972 paper. Many people apparently still believe it.

...

For a couple of years after my discovery of the bakery algorithm, everything I learned about concurrency came from studying it. ... The bakery algorithm marked the beginning of my study of distributed algorithms.
    -- Leslie Lamport

I find this story fascinating. Lamport has invented a bunch of cool algorithms. But here he describes having "discovered" the Bakery algorithm, and then spent years studying the algorithm that he had written afterwards.

How many of us find a solution to a problem, and then spend years studying the solution, learning from it? Actually I think I've learned more from studying bugs in my code than algorithms. If I could just avoid ever coding any bugs...

Lamport has done a bunch of other stuff, including inventing Paxos, the distributed consensus algorithm behind google's distributed lock manager Chubby.

February 21, 2008

Nobody is really smart enough to program computers

Fully understanding an average program requires an almost limitless capacity to absorb details and an equal capacity to comprehend them all at the same time. The way you focus your intelligence is more important than how much intelligence you have.

At the 1972 Turing Award lecture, Edsger Dijkstra delivered a paper titled "The Humble Programmer." He argued that most of programming is an attempt to compensate for the strictly limited size of our skulls. The people who are best at programming are the people who realize how small their brains are. They are humble. The people who are the worst at programming are the people who refuse to accept the fact that their brains aren't equal to the task.

The purpose of many good programming practices is to reduce the load on your gray cells. You might think that the high road would be to develop better mental abilities so you wouldn't need these programming crutches. You might think that a programmer who uses mental crutches is taking the low road. Empirically, however, it's been shown that humble programmers who compensate for their fallibilities write code that's easier for themselves and others to understand and that has fewer errors.
      -- Jeff Atwood, Coding Horror

February 20, 2008

Leak proof

So for now, my advice is this: don't start a new project without at least one architect with several years of solid experience in the language, classes, APIs, and platforms you're building on. If you have a choice of platforms, use the one your team has the most skills with, even if it's not the trendiest or nominally the most productive. And when you're designing abstractions or programming tools, go the extra mile to make them leak proof.
    -- Joel on Software

February 19, 2008

Code must be nurtured

Here's a theory of software quality for you: software must be nurtured. The existence of bugs isn't mysterious to any honest programmer. They are the product of neglect. Finding a bug in one's code isn't so much a surprise as a feeling of deja vu. Ohhhh yesssss, I remember thinking I should check that condition. Programmers have complete control over the quality of their code and, when working on code they care about, tend to produce things that work. The secret is to care for the programmers, so that they take good care of the software.
      -- Coderspiel

February 17, 2008

Quote week

"There's something deep in software development that not everyone gets but the people at Bell Labs did. It's the undercurrent of "the New Jersey Style", "Worse is Better", and "the Unix philosophy" - and it's not just a feature of Bell Labs software either. You see it in the original Ethernet specification where packet collision was considered normal.. and the same sort of idea is deep in the internet protocol. It's deep awareness of design ramification - a willingness to live with a little less to avoid the bigger mess and a willingness to see elegance in the real rather than the vision."
      -- Michael Feathers, Beautiful Code blog

February 12, 2008

Amazon is the Google of buying stuff

I went into a little corner non-chain convenience store by my house (the "Devonshire Little Store") for some milk and noticed a big plastic tub of Dubble Bubble at the cash register.

Folks I've worked with know that I have a thing for gum.

I was doing the math at $0.10/piece, but then figured "what the heck" and asked if I could buy the whole bucket. That seemed to piss off the store owner.

He said he would only sell me $5 worth at $0.10/piece. The bucket said "180ct" and was about 1/3 down. I tried to chat him up. "You can't get this at at Costco, can you?" "No, not at Costco." He wouldn't tell me where he got the Dubble Bubble buckets wholesale.

An hour later, I'd chewed through half of my stash and was thinking there had to be a better way to get quantity gum.

Enter Amazon. I'm happy to say that 1,260 pieces of Dubble Bubble (in various-sized plastic buckets) are now on their way to me. I'll have them tomorrow.

Recently I've found that my online purchasing has increased, and consolidated, through Amazon. I did 80% of my christmas shopping through Amazon. I've bought scissors, wall thermometers, toys, video games, a camera, a bunch of DVDs, and of course books...

A couple of things I've had to go outside to get..MREs (ebay), and a coffee machine for the office (amzn didn't have the model we were looking for.) But if amzn has it, I use them to buy it.

Give credit to Bezos ... he's built the best ecommerce fulfillment platform in the business. One-click purchasing, Amazon Prime, reviews, "Where's my stuff?", multiple credit cards and shipping addresses on file in my account... it all just works.

And with their merchants, they offer just about everything.

When I go somewhere else on the web to buy stuff it's invariably a rude shock. Basic gaps in the checkout process. Delayed or missing order confirmation emails. Bozo shipping policies. Stuff that I don't have to worry about with Amazon.

When I want to know something, I go to Google.

But when I want to buy something, I go to Amazon.


Categories