Do not buy a car if you cannot drive (or what to avoid when selecting a data mining partner)

You arrive in a foreign city for a couple of days and you need to get some stuff done and attend some meetings. You have no idea how to move around and you don’t speak the language, nor you are able to drive. Yet, people approach you to sell you a car, competing on the features and capacity of their offering, then also giving you the opportunity to take some lessons so as to be able to take ‘full advantage’ of it. You end up paying a lot of money for the high-end car and wasting your stay in attending driving seminars; yet you still are not able to navigate around and you missed those meetings, you are finally told that the city is to blame.

compass - a photo by flickr and user maldita la hora

compass - a photo by flickr and user maldita la hora

The above story probably sounds peculiar and the approach taken is awkwardly wrong; after all it doesn’t take that much of intelligence to simply get a taxi or hire a driver for those two days. Still and to our surprise, this is how the enterprise world works when it comes to selecting a data mining solution or tackling a data intensive problem, and we’d like to serve our fair share to try and shed some light, given our very limited experience thus far. Here’s the original version of the story:

A big enterprise wants to leverage the big amount of data it collects so as to get actionable recommendations with regards to some pressing topics of its operations. The company does not have the talent, nor the experience and expertise to handle that, so it starts looking for external partners. Typically, these end up to be big software vendors and their certified partners, providing ‘holistic’ and ‘customizable’ solutions to every specific problem on earth. The managers get excited; these solutions seem to provide everything they need to feel comfortable about, including a complete stack of algorithms, full integration with their CRM and existing intra-enterprise infrastructure, also a well-known brand and tons of phone support staff. So, they proceed to sign the really fat check and assign the rest to their colleagues.

It takes a few weeks for the vendor to set everything up, then the various modules turn out to be quite complicated and the manuals non-readable at all, so the company proceeds to also pay the extra specialized training partners to educate some selected people of its personnel on how to import data and what to click for an algorithm to operate. Now that they know how to put the tools into action, the company is finally able to get back to the original problem and try to come up with a working solution. But, guess what, they own the Ferrari, they know how to drive it, but they have absolutely no clue on how to navigate across the chaotic city, or interpret the various signals to get the job done!

Data mining software is not the equivalent to, say, Microsoft Office. If you have some basic Word skills, you’re probably able to write and edit a text document; however, if you know where to find the Support Vector Machines button in a data mining suite, you’re still far away from coming up with any working solution to your problem. Trial and error, standing at the core of anything related to data mining, requires a deep understanding of the underlying processes behind each one of the numerous algorithms, enabling you to follow best practices on every step, pick the right instruments and be able to interpret the results so as to proceed to the right direction and improve, from naive results to really insightful ones. After all, in experimental problems, the tools are not the ones which matter the most; actually they do not matter at all if there are no experts to put them in good use.

At the end of the day, experts are needed to tackle such sophisticated problems, and our humble recommendation is to start with them, and then decide on which tools you need to acquire, if any at all. You may also consider that, in most cases, data mining is not a task that you need to pursue on a daily basis; capable resources are required into solving a problem once, then you just replicate its solutions across time. Moreover, since you have acquired good results, there’s typically no need for an expert to replicate these into new data sets; you may just run an already available black box of code, or implement some fully transparent recommendations using naive tools, like advanced filtering. After all, it typically makes much more sense to get a taxi to move you around than hire a full-time driver and have him waiting till the next time you happen to be in town.

We do believe that the above may help you to get a better understanding of how things work, or actually should work, in the industry and what dead end solutions you need to avoid. In any case, we’d happy to show you around and help you find your way next time you’ll get lost with data or you happen to land in the far-off area of data mining, rest assured that there lies really great value waiting to be discovered!

2 Responses So Far... Leave a Reply:

  1. Agree – building data mining solutions require skills and training, not just tools.
    Avinash Kaushik has 90/10 rule for web analytics – 90% of resources go to people, 10% to tools.
    Percentage of spend for people in data mining solutions is probably less than 90%, but it should be more than 50%,


  2. rob says:

    real world example?

Powered by WP Hashcash

Warning: gzdecode(): data error in /home/gtzi/ : runtime-created function(1) : eval()'d code on line 29