I’ve recently read the book: Programming Collective Intelligence written by Toby Segaran released from O’Reilly. Not only does the book detail how some of the algorithms are created, but also how they could be used to correlate real data. The examples span various channels from online dating to stock market trends, from auction price setting to spam filtering and from document classification to search rankings. It’s through these examples that allow for clear understanding of how the algorithms can be applied. This also keeps the read about statistics, data and algorithms interesting.
The examples in the book are executed in Python. The advantage of Python is that its dynamic nature allows for brevity in the code required to demonstrate the ideas. It also allows for rapid changes to the way the data is processed. Even those not familiar with Python can figure out the examples.
Some of the algorithms require heavy processing every time a comparison relationship is made. While these processes can be expensive to execute the benefits are that there is often little training of the system to be done as every new relationship that is created is based on the whole of the previous set. One could imagine the sheer amount of processing this could eventually require as the data sets become enormous. Other algorithms have less processing efforts, but require large training sets to create meaningful relationships. Some, but not all of them require retraining every time new data is added.
Besides detailing specifics of how the data is stored and where efficiencies of processing vs storage lay Toby speaks to the transparency of understanding how the algorithms function; sometimes even demonstrating how to visualize them through use of graphics libraries. The last chapter of the book is an overview, this provides a great summary of all the topics and algorithms detailed in the other chapters.
The chapter on Neural Networks is quite compelling; training creates relationships that become cross weighted for arriving at a solution. Lots of training is required to produce accuracy, so lots of test data is needed or early accuracy should be scrutinized and reveals how hard to judge the accuracy and what steps were taken to arrive upon a solution. Another interesting chapter was on Evolutionary Intelligence which models a survival of the fittest approach to breed and mutate the best parts of one solution with another until the “best” solution is found that maps back to the entire data set. While this model continues to spin various permutations, it’s heavy in processing.
Some other interesting points of note:
This book is a great resource for thinking about possibilities of collecting data and mining it for meaningful information. As the web continues to personalize for better experiences it’s the mined data that can set one property ahead of another. I’d recommend it for anyone looking to mine specific data and for those looking for an overview of bridging statistics and computation. Readers can then dive deeper into the specifics of the an individual algorithm through other channels.
While this is a great overview, readers that choose to implement these algorithms will have to weigh how to best optimize for the correlations they look to receive, the amount of data and the hardware they have allocated to support it.
Nate is currently a Senior Presentation Layer Architect at Razorfish Chicago. As an SPLA Nate: participates in technology leadership team and resource allocations, manage fulltime and contractor resources, represents technology for groups of brands across multiple clients, furthers development of standards within the office, architects project implementations and fosters community and mentoring.