Not that Sane: Open-source software for data mining

What's the world coming to? The New York Times has an article on open-source software used for data mining. In this case, it's a hagiography of R:

Some people familiar with R describe it as a supercharged version of Microsoft’s Excel spreadsheet software that can help illuminate data trends more clearly than is possible by entering information into rows and columns. What makes R so useful — and helps explain its quick acceptance — is that statisticians, engineers and scientists can improve the software’s code or write variations for specific tasks. Packages written for R add advanced algorithms, colored and textured graphs and mining techniques to dig deeper into databases.

But R also has quite a learning curve. The easy things are easy, yes. But the easier things are probably even easier in Excel. For the more complex things, you do need to know how to program. I really learned to use R by reading the book (what a quaint concept) by Venables and Ripley.

What other open-source software do I use in my day-to-day data mining work? I use Weka for quick analysis of data sets (the corresponding book by Frank and Witten is useful to understand what these tools are doing). For neural networks, I use SNNS even though it's quite old and not quite maintained anymore.

The main reason I use these tools? They're scriptable so that I can combine UNIX shell scripts with statistical analysis on large data sets and automate the whole damned thing. That's hugely important for real-world data mining, and it's something that closed-source software makes hard. Often overlooked in the case for open-source tools is that it is typically easier to incorporate them into larger, more complex systems.

Not that Sane

Open-source software for data mining

No comments:

Post a Comment