The upward spiral of innovation

Editor's note: Steven Struhl is senior vice president, senior methodologist, at Total Research/Harris Interactive.

SPSS has continued its fast-paced schedule of releasing both new products and major updates to its growing catalog of software titles. This article will review three updates which we trust will have wide applicability to the needs of data analysts. We promise to do our best to keep you on the edges of your seats.

We will start with a new version of the base SPSS package, Version 11. We also will look at an excellent charting package, DeltaGraph Version 5, and examine a new iteration of the SPSS classification tree (CHAID and CART) package, AnswerTree 3.

SPSS Version 11

Although the list of new features in SPSS Version 11 would extend across several pages, it seems more like an incremental upgrade than a major revision which effectively changes the program's operations or greatly expands its capabilities. The program still retains its basic structure (a base program with added modules that do more specialized or advanced tasks). To get the full range of capabilities in SPSS, you would need - in addition to the base - the advanced and regression models modules, conjoint, trends (for time series analysis), categories (for correspondence analysis and related procedures), and perhaps the special module for missing values analysis. If you work with small samples, you might also want the SPSS "exact tests" module, which returns incredibly precise statistical test results with limited amounts of data.

SPSS retains its basic "tree and output window" structure for organizing the results of statistical procedures. Unfortunately, it still does not put any titles you specify directly into the tree window, where they can be seen easily, but rather refers to them as "page titles" there (see Figure 1). So the tree still does not allow really rapid navigation to a spot of interest to you in a long output file, unless you alter the text in it by typing over the default labels.

Figure 1

In addition, the output from SPSS still can be read only by SPSS itself or its companion Smart Viewer. However, you can paste chart-based output from SPSS directly into spreadsheet programs such as Microsoft Excel, and the small amount of text-based output it produces into a word processor such as Microsoft Word. You can retain all the formatting in SPSS tables by exporting them in HTML format (using the SPSS file menu), and then opening them in a spreadsheet. This works particularly well with Microsoft Excel, which now includes HTML as one of its native languages.

In keeping with recent versions of SPSS, tables are very nicely formatted, and certainly meet or exceed the requirements for any scientific publication. These charts may need a little simplification for use in reports for more general audiences, though - which is where the ability to export to Excel becomes very handy.

Graphs created in SPSS still cannot be manipulated as "live" objects in other programs. Unfortunately, SPSS still gives less control over many charting options than does a program such as Excel. For instance, efforts to change the starting and stopping values on the axis of a chart produced in SPSS gave your reviewer only frustration. To get complex charts to come out very much as you would like them, you will find the SPSS companion product, DeltaGraph, to be a much better choice - as we discuss in the next section of the review.

SPSS retains its ability to accept most commands either through an extensive series of menus and dialog boxes, or by typing and running them in a special syntax command window. The syntax window is just a regular text editor, something like the notepad program that comes with Windows, and all SPSS command files are made up of plain text. SPSS syntax files remain the one item related to the program that can be easily modified with other software.

SPSS also retains many of the eccentricities of earlier versions. Some useful commands remain unavailable from its menus, and so must be typed into the syntax window. One example of a missing menu command is the option to rotate discriminant analysis solutions. Rotation of these solutions has much the same effect as rotation of factor analysis solutions, leading to clearer, more easily explainable results. To get this done you must perform some careful surgery on the commands available from the menus, or just type everything from scratch. Similarly, the entire conjoint analysis procedure still requires use of the syntax window, with no menu equivalents.

The typical old-timer (or as your reviewer prefers to think of himself, a very experienced user) mostly won't mind the syntax window. After all, early versions of SPSS accepted only typed commands - and at one time (just after Roman gladiator days) those commands even needed to be on computer punch cards. However, newer users most likely will find the need to go to the syntax window a little vexatious. Getting used to SPSS syntax and some areas where the program can be very picky about it - such as the use of periods - can be a challenge to those just starting with the program.

The menus themselves can be somewhat unsettling to neophytes as well. Nearly all of the program's analytical procedures are crowded into one menu, called Analyze. The grouping of the commands there is not entirely intuitive (both clustering and discriminant analysis are grouped under the entry "classify," for instance). Some procedures are never named directly in the menus (for instance, MANOVA is run by selecting "general linear model" and then "multivariate"). In any event, your reviewer would like to give SPSS a gentle hint: the program's interface isn't quite where it needs to be yet, especially for newer users. We can only hope that in upcoming versions some enhancements like those discussed here will find their way into the program.

Dramatic change

Perhaps the most dramatic change in SPSS is a new ability to rearrange data files. You can change data to and from the so-called univariate layout - or several records per respondent - to and from the multivariate layout - or one record per respondent. This capability can be really handy if you have data arranged in a way that makes it impossible to do certain forms of analysis. For example, doing repeat measures analysis of variance requires data in the multivariate layout, with the repeated measurements all recorded on one line per respondent.

In the more advanced analytics, perhaps the most salient improvement comes in linear mixed models. A new procedure allows you to construct models to analyze data that fall into a nested structure. One particularly nice feature of this is that you can do incomplete repeated measures, in which the number of observations varies across subjects. This can prove to be highly useful if you need to analyze, for instance, a patient record study where patients have different numbers of visits or observations.

The number of formal ANOVA models you can specify has expanded somewhat as well, which should be helpful to more advanced data analysts. To get these new capabilities you will need to get the Advanced Models option.

SPSS also has improved the performance of several of its procedures - although with a 1GHz machine most of these seemed fairly speedy already. General linear models, proximities, hierarchical cluster analysis (in the SPSS base system), and multinomial logistic regression (in regression models module) now run much faster than in earlier releases.

Improvements also have been made to the SPSS database wizard. This now allows you to recode categorical "string" (or alphabetic) variables automatically into numeric variables, and to retain the original character-based values as value labels. You now also can extract random samples from large data sources.

Some moderate inconveniences found in earlier versions have been ameliorated. The text wizard, which reads in ASCII and delimited data, now is approaching the level of the excellent one found in Excel. The SPSS wizard now allows you to read CSV-format text data that contains text qualifiers (such as, "1,000," "2,000," etc.), and to specify a wider variety of text separators for delimited data. In another welcome enhancement, SPSS also no longer forces you to use scientific notation for small numbers in your output. You can choose not to see this notation at all, showing just decimal values if you prefer. The insistence on scientific notation for small values in earlier versions could sometimes require reformatting of charts before they could be used.

There are many other enhancements to the product, some small and others doubtless important to various readers. A full list can be found on the SPSS Web site (www.spss.com).

Less is less

One unfortunate change in SPSS is the elimination of manuals for all but the base product in the package you receive from SPSS, even if you buy the base with many optional modules. Documentation for these now is provided in PDF form on the installation CD-ROM. You also can install the manuals on your computer's hard drive. However, for some users, the paper manuals are still indispensable. If you want these, you now need to order them and pay for them separately. No doubt this saves something in production costs for SPSS, but your reviewer wonders if this is much of a service to the user. The extremely useful syntax guide has long been an extra cost item, so perhaps this change for the manuals was inevitable. (By the way, I highly recommend the syntax guide, even as an extra purchase. At times, it will provide an answer for a problem that does not seem to be addressed anyplace else, and is often handier than the corresponding pop-up screens in the help system of the SPSS program.)

At the very least, SPSS could have devoted some special sections of the help system to the new procedures instituted in this release and how to use them. Your reviewer did not find full descriptions of these and how they work in one location.

Version 11 and Windows XP

The documentation that comes with SPSS Version 11 states quite plainly that the product has not been tested with the new Microsoft operating system, Windows XP. This is quite odd since working late-stage beta releases of the XP system have been available for some months. Perhaps SPSS is just hedging its bets, since Microsoft has been known to put some last-minute surprises - apparently intentional and not - into many of its products. However, your reviewer's preliminary trials of SPSS on a Windows XP machine reveal absolutely no problems so far. In the spirit of SPSS, though, let me be quick to add that this is no guarantee that everything will run just as smoothly on your PC. Please check the SPSS site for updates on anything that might possibly not work.

Extend and refine

Grumbles about the manuals aside, this release continues to extend and refine the basic SPSS product. As stated earlier, it does not change much about the basic functioning of the program, except for a new ability to rearrange data. It is best characterized as mainly an incremental upgrade, certainly of great interest to those who want to stay up to date, but not crucial to those who do not need any of the improvements in this release. As has been usual for SPSS releases, everything seemed to be in perfect running order in the first shipping version of the product. That is something many software makers could use as a shining example.

DeltaGraph 5.0

DeltaGraph started life, a good number of years ago, as one of those Macintosh programs which users of IBM-compatible PCs strongly coveted. It has long been a top-notch charting and graphing program, boasting both an extremely wide range of graph types and the ability to customize nearly everything in any graph it created.

For the new version, SPSS has entirely revamped the program. It is much quicker in general and more responsive to commands. There no longer is any hesitation as the program loads - it is so quick on a 1GHz-class machine that the usual opening "splash" screen (with the program name and a cute graphic) never appears at all - if it even exists. SPSS also has cleaned up the operations and menus of the older versions, which used to be confusing even if you used the program often. (In older versions, you needed to search up to three menu locations and right-click on the object in question to see if you could change many aspects of the chart.) Now the program is much more intuitive, although so many things can be changed on a chart that you still need to learn which ones are specified someplace on the screen and which ones can be reached with a simple right-click of the mouse. Moving from the old menu system to the new one, though, was generally painless.

DeltaGraph always made an incredible variety of charts (well over 100), and now has managed to add 11 new types to the list. It is very likely that you will find precisely what you need from its selections, or be able to customize a chart it produces so that it closely matches your needs. The variety of options it presents may even start you thinking about new ways to present your data.

For instance, one DeltaGraph chart type that is rarely seen but seems quite useful is called an x/y bar chart (and this can be expanded into an even more complex variant called the segmentation bar chart). In the x/y bar chart, both the heights of the bars and their widths have meaning. You could, for instance, specify that the height of each bar represent that group's level of agreement with an idea, and the width represent the size of group, as in the example shown here.

Figure 1a

Perhaps one of the program's most intriguing features is that it works from within Microsoft Office. DeltaGraph can place a button on the toolbar in a Microsoft application (such as Excel or PowerPoint). If you click on this button in the Microsoft program's toolbar, it calls up DeltaGraph's charting abilities without leaving the Microsoft application. When you work in Excel, updates to your data are reflected automatically in the DeltaGraph chart. As long as you have both programs on your PC, then, DeltaGraph charts can behave like "live" objects in Excel. (Of course, if you move the spreadsheet to another computer without DeltaGraph, the chart will become a completely static - if still pretty - picture.)

Beyond the variety of charts, DeltaGraph has many other excellent features. A real favorite of your reviewer's is that, unlike Excel and PowerPoint, it really creates charts the way you want them. So if you want a horizontal bar chart with 28 bars and you want them all labeled in 11-point type, DeltaGraph will do what you ask. (In this case, PowerPoint and Excel both will insist that they know better than you, and skip at least every other label, unless you make the chart something like 20 inches tall. You can safely try this at home and see exactly what your reviewer means.)

DeltaGraph now can import files from SPSS, the major spreadsheets and databases, and from various types of delimited ASCII (or plain) text. It can import graphics in a wide range of formats, and export its charts in still more formats. The imported graphics can be put to some fairly exotic uses, such as filling bar charts with custom pictorial symbols.

DeltaGraph retains its helpful chart gallery, which provides visual cues and descriptions that aid you in finding and choosing the chart or graph you need. If you are not sure how to set up a chart, you can display one with sample data provided by DeltaGraph and then examine the underlying spreadsheet to learn how it is done.

A particularly useful feature of DeltaGraph is that it allows you to save charts with all the custom features you specify in a "library," and further, to integrate your custom charts with its standard selections when you look in the chart gallery. In this way, you can quickly review the chart types you have modified and see how these compare with each other and the program's preset choices.

The program also boasts a chart wizard which it claims will turn new users "into pros in seconds." I cannot really speak to this, since wizards tend to slow me down in programs with which I am already familiar. However, the wizards did seem logical and generally seemed to give good advice.

In keeping with its serious scientific side, the program also includes a sophisticated equation editor. This gives you the ability to create publication-quality equations along with your charts and graphs, should you ever need these. You can edit these equations directly on the DeltaGraph page. DeltaGraph also includes a healthy variety of statistical functions, and allows you to manipulate and transform data without leaving the program.

Too new for its own good?

When SPSS revised this program, they made one exceptionally questionable decision: they did not make it compatible with earlier releases. This means that you cannot import chart templates from earlier versions of this program or even open charts from earlier versions. Your reviewer has to wonder just what the folks in programming were thinking when they did this. Not only could work done even in recent weeks become completely inaccessible, but also all the many hours spent building up an extensive chart library in earlier versions are rendered useless.

You can rest assured that your reviewer both made an irate phone call and sent a stern note to the people at SPSS about this. So far, their response has been a lukewarm assertion that it is a sort of shame that nothing from old versions works with the new one - but don't worry, you can keep both versions on your PC. This raises a somewhat obvious question: if the old version is good enough to keep around indefinitely, why bother with the new version anyhow?

SPSS really needs to do some serious work here, and write import routines for both charts and chart libraries created in older versions of the program.

As good as anything

If you are looking for an advanced charting package for the first time, DeltaGraph seems as good as anything you will find anywhere. It is flexible and allows you to do what you want with charts and graphs. Once you have a graph just the way you want it, it can go into the "library" and serve as a template for later graphs you make. The program works within Microsoft Office, so you finally can get Excel and PowerPoint to stop acting smarter than you and create the charts you want.

If you are thinking about upgrading DeltaGraph, fire off a note to SPSS and ask them when they are planning to make the product compatible with earlier versions. Then consider whether you want to go through the rather frustrating process of recreating everything you did in earlier versions of this program. The new upgrade indeed is excellent, as you likely would expect from using earlier versions, so balance this carefully against the inconvenience that will be caused by a lack of compatibility with the last iteration of this product.

AnswerTree Version 3

SPSS bills the latest version of AnswerTree, its classification tree software package, as data mining software. Since the term data mining may not be entirely clear to many readers, and classification tree analysis also may remain somewhat vague, let us start by trying to delineate what "mining" of this type is all about.

Data mining, aside from being viewed as part of the good, the true and the beautiful, in practical terms usually involves sifting through large heaps of data. Also, the data sets being mined typically have not been collected or structured for the purpose of being analyzed.

The phrase data mining also often serves as a kind of code for digging through the entirety of a huge database, whether this runs to gigabytes (billions of bytes) or terabytes (trillions of bytes). Therefore, just the fact that software can tackle an entire database of any size may, in the minds of some, transform it into a data-mining tool.

The larger question is what extra value we can find in analyzing a terabyte of data rather than a "sample" as large as 40,000 to 200,000 records - which a relatively high-powered PC can handle without much strain. The working idea behind poking through all the data, regardless of how much, seems to be that if you have enough data, it is practically imperative that something of value will emerge from it.

In case you had any doubts, this belief is not true. In fact, very large data sets can cause problems of their own, leading some statisticians to suggest there may be such a thing as too much data. With large enough samples, nearly all (or all) variables can look like significant predictors - and with their significance often at astronomical levels. Finding what is truly important can become quite difficult if you have (say) 100 variables significant at a level of 10-20 or better. This can happen quite easily with a very large data set.

Nonetheless, the idea that you will benefit by sifting through everything in a database has become popular. The new version of AnswerTree indeed will allow you to do this, as it now comes in both "client" and "server" versions, just as the main SPSS program has done since its version 10. The server version is designed to attack huge masses of data, with the PC program sending commands to the server PC which in turn does all the heavy data manipulation. This ability to reach right into a database, and get the larger computer to do the actual calculation, has obvious advantages compared with trying to move a huge amount of information to a PC and process it there.

What classification tree analysis does

We don't have space in this review to do a complete summary of classification tree analysis and all its remarkable powers but here are a few high points.

Basically, the method creates its tree structures by splitting the sample (or database) repeatedly. More specifically, it finds ways to split and re-split a sample to create groups with relatively high and relatively low incidences of some important variable. If, for instance, we want to find the demographic characteristics most and least associated with, for example, sticking with a diet, classification trees are (in your reviewer's opinion) one of the best ways to do this. Classification trees can simultaneously handle all types of data - nominal level, ordinal level, and continuous - in one analysis. A small classification tree, labeled and arranged as your reviewer likes them, appears in Figure 2.

Figure 2

Classification trees also have special capabilities for handling missing data. Rather than dropping any individual with a missing response, as most multivariate procedures do by default, or substituting an average, classification trees treat "missing" as just another response. The most sophisticated classification tree programs can in fact handle missing data in several ways and allow the user to choose one. AnswerTree does not offer options in the ways it handles missing data.

Finally, these methods allow interactions in data to appear in levels of detail and complexity not possible with any other method. Suppose you create a split in the sample and below it different predictors appear for each of the values found in the first split - just as we have in our sample tree diagram. This means that different values of the first predictor in the model (how often the person limits carbohydrates) lead to different variables emerging as the most significant predictor.

This is nothing other than an interaction between the variables in question. That is, the ways the predictors in the bottom row of the model behave depend upon the values of the predictor in the row above them. Without knowing the precise state of both the first and second predictor in each "branch" of the tree, we cannot estimate the value of the dependent variable. The complexity of this interaction most likely could not be captured by most other methods - note that low values for how often the person limits carbohydrates lead to one predictor as most significant, middling values lead to another, and high values lead to a third predictor.

The ability of this method to ferret out complex interactions led to one of its earliest names: CHAID, which stands for Chi-square automatic interaction detection. (The Chi-square is the test that the method uses to determine significance.)

Other extensions of classification trees

Like everything else in the world of data analysis, classification trees have added complexity upon complexity, and so have grown nearly impossible to explain in detail to any but the fully initiated. However, we need at least to make a start on some of the alternatives available, since AnswerTree offers them. The CHAID algorithm - the first method to develop classification trees that actually works - has since been joined by several others. Two of these (and a more advanced variant of CHAID) are offered by the AnswerTree program.

The advanced variant of CHAID is called the "exhaustive" method, and is a substantial advance over earlier analytical procedures for creating classification trees. Basically, ordinary CHAID sometimes would stop before it found the most powerful way to split a sample. That is, it would stop testing ways to split the sample as soon as it found a way to make all the groups statistically different. Exhaustive CHAID goes on and continues to test all possible ways of splitting the sample. It usually finds more possible predictors and stronger levels of statistical significance than the garden-variety method, and so is preferable.

AnswerTree, of course, offers users the choice of the preferable (exhaustive) and the non-preferable (ordinary) methods, just in case you would really like to use something inferior. (This is doubtless the same spirit that impels SPSS to offer over 20 methods for comparing groups in ANOVA, including several that have been thoroughly discredited. In any event, if you really want to do something wrong, the helpful statistics program is there to aid you in doing it.)

AnswerTree also offers the ability to perform analyses using a different strategy. Classification tree analyses of this type are called CART. In many cases, the method is called C&RT, since one software maker managed to get an (r) symbol on the acronym "CART" itself. CART or C&RT stands for "classification and regression trees." This nomenclature seems a little like allowing one manufacturer to put an (r) on a term like "dog food" so that everybody else in the business needs to call their product something like "dog f_d." However, ours is not to question the wisdom of regulatory agencies.

The acronym CART (or C&RT) really is proper usage for any classification tree analysis where the dependent variable is continuous. You cannot use Chi-squares to test significance unless the dependent variable is nominal or ordinal. (And it would seem that without the Chi-square involved, the first part of the acronym CHAID no longer applies.) However, CART or C&RT now means something else entirely. It is applied to a method that can split a sample only into two groups at any point in the tree diagram - or more formally, a method that always bifurcates the sample.

The advantage of CART/C&RT is that it can not only build a tree by going forward and looking for significant predictors, but it also can tear the tree back down again. Tearing the tree back down can lead to very economical models (with just a few important predictors). This can happen because a variable that looks significant at some point in a classification tree may not add to the overall predictive power of the model. That is, the incremental gain from adding another predictor - even one that looks significant - can become insignificant after some point in the analysis.

CART/C&RT also can do some fancy types of model validation not available with CHAID. With CHAID, model validation is done as it is in most other methods, by having a big sample and subdividing it into two portions, or partitions. One of these partitions, often called the "main" or "learning" sample, is used to develop the model. The model then gets tried on the other partition, often called the test or hold-out sample. How well the model works in both partitions is then compared.

In most cases, the model will perform somewhat worse in the test partition because the model tends to fit the peculiarities of the main or learning partition. Therefore, how the model works in the test partition is said to give an estimate of how well it would function if applied to an entirely new data set. This really isn't true, but it does tend to tone down overly optimistic estimates of how the model will perform if used again in the outside world.

CART/C&RT can do something called cross-fold validation, which repeatedly takes random subsets of the sample and sees how the model performs in those. (Usually 90 percent of the sample is drawn 10 different times, and results in each of these 10 subsets are compared with results in the entire sample.) This turns out to have about the same effect as using a main and a hold-out sample, in that the model seems to perform somewhat worse than in the total sample when results are averaged across the randomly drawn subsets. So this method also can tone down overly optimistic estimates of how well a model will perform with other data.

Finally, AnswerTree provides yet another analytical algorithm, called QUEST. This method is something like CART (or C&RT), but has a speedier algorithm. Whether it produces results that are as good is another matter, and still open to some debate, since QUEST is a relatively new method (c. 1997). With AnswerTree, it is there if you want it.

Analytically capable, not flexible in output

As the section above suggests, AnswerTree is analytically versatile. Also, as a reminder, it offers the option, with its server version, of tearing into a huge database in its entirety.

AnswerTree seems very capable analytically, but not flexible in its output. It offers the user a lot of information, but does not allow much customization in presentation. This may be more of a problem to your reviewer, who wants all his results to look just so, than to many users. AnswerTree has advanced substantially over its last version, in that its tree diagrams now paste nicely into programs like PowerPoint, where you can edit them. (In earlier versions, the tree diagrams could be pasted only as bitmaps, which are just collections of dots, and so not readily edited.)

Figure 3

As Figure 3 shows, the basic interface in AnswerTree is rich in information. This shows the first split in a classification tree (CHAID) analysis. In the main (largest) window, you see detailed information about the significance of the split, and how the distribution of responses shifts in each of the groups formed.

Somewhat less clear is the indication of just exactly how the groups are defined. For instance, you have to know something about the data to figure out just what a response of “<=Yes” means. For instance, you would not know if this category includes “No” or not unless you knew the number codes involved. Unfortunately, the level of detail you see in the diagram is all you can get. You cannot get the specific categories included in “<=Yes” to be shown. This is a really salient shortcoming in the display of results. In fact, the way that the specifics appear in each box (or “node”) in the tree also is set. You have a choice of a display like this, or this display with a small chart added below the distribution of responses. Asking for a more compact display, or one that omits some of the information shown, just isn’t possible.

If you look back at the first tree diagram sample in this section, you will see that only one category of response (the percentage who quit) appears in each box. If AnswerTree produced that diagram, it would print out both the percentage who quit and the essentially redundant information about the percent who did not quit. (Percentage who did not quit equals 100 percent minus the percentage who did quit.) With an AnswerTree diagram, you would need to trim out the information you did not want from each node in the tree.

Looking at the smaller windows in Figure 3, you also can see that a chart of the node highlighted in the larger window (it has a dashed line around it) is produced automatically (see the lower left corner). Above that, an overall diagram of the tree appears. There is some very strange numbering appearing in that diagram. At the top there is a box or node numbered “0” and then two nodes below it numbered “14” and “15” respectively. You might think, as your reviewer does, that the numbers 1 and 2 naturally should follow from 0, but this is not necessarily the case in AnswerTree. The reason for the high numbers on these nodes is that I tried several other possible predictors in that location before settling on the one you see there as most useful for presentation. Unfortunately, AnswerTree insisted on giving new numbers to all the nodes tried and discarded, which seems like (to put it mildly) a non-optimal approach. A program that can do all the intensive calculations required in thousands or millions of comparisons ought to be able to start counting with the number 1.

At the bottom of the diagram, you see some tabs that point to further information about the tree. The gains analysis can be a very helpful supplement to the tree diagram itself. The gains chart shows a great deal of detail about the groups at the end of the tree (the so-called “terminal nodes”), but not about the sequence of events that led to the groups’ formation. The gains chart gives another way to organize and augment the information in the tree diagram. As trees get larger, there usually is not a straightforward progression from (say) very high incidence of the group being studied to very low incidence of that group, as you move across the groups at the bottom of the tree diagram. The gains chart shows these groups in order of incidence, either from low to high or from high to low, and gives many other statistics about each group.

Figure 4

Ideally, a gains chart should give the defining characteristics of the group along with all the numbers associated with that group. As the small sample gains chart from AnswerTree in Figure 4 shows, the program does not do this. Rather, it just shows the number that the program has given the group, and then all incidence figures for the group. To determine who actually is in each group, you would need to append the group descriptions to the AnswerTree gains chart yourself. This is a tedious process to do by hand, and something that a computer program can automate easily. However, this has not yet happened with AnswerTree.

AnswerTree also produces classification rules, which are a simple set of if-then statements that describe the various groups formed. There may still be a slight bug in the program in displaying the rules. I could highlight several of the nodes on the tree diagram and get the program to show the rules for them at the same time. However, if I tried to highlight all the ending nodes, and so get rules corresponding to all the groups formed at the end of the tree diagram, the program just returned a one-line statement. This seems to be something that SPSS needs to investigate.

However, such rules as are generated can appear in ordinary English, in SPSS syntax, and in a format that can be used with an SQL database. Once the problem with the rules gets cleared up, the program should be very handy to use with SPSS itself or with database programs for “scoring” people - or identifying in which group found in the tree diagram they most likely would belong.

AnswerTree also does not allow the user to do anything with the data - other than identify whether each variable is categorical, ordinal, or continuous, inside the program. For instance, you cannot “map” data - doing operations such as collapsing the categories of a variable if this proves to be valuable during an analysis, or changing descriptive labels to something that proves to be more intelligible. Rather, you must exit AnswerTree, do the needed transformations in SPSS, and restart the program.

Unfortunately, AnswerTree apparently does not allow you to “inherit” an analysis - that is, to use the settings from another project and apply them to the one at hand. So it appears that once you exit AnswerTree, even if this is just to do a simple re-labeling of variables, you must set all the required parameters for the project again.

To conclude this section with something positive (always good manners according to some more influential family members), AnswerTree has become substantially faster than the last version. In the last release, you would set up the project, tell the program to start analyzing, and then go out and make a cup of coffee, hoping it would be done before you returned. This version still hesitates for a short time at the same point in the analysis, but seems many, many times faster there and overall. The improvement in speed is most welcome.

Costs

If you have been following these reviews and have an incredibly retentive memory - or if you have looked at enough software yourself - you will already suspect that the server version of AnswerTree is in the enterprise class of software. In ordinary English, this means the program can handle larger problems than mere PC software, and that it costs far, far more. Expect enterprise class software to cost somewhere between 10 and 100 times the price of its PC counterpart. The server version of AnswerTree is no exception.

Commercial classification tree software, as a class, tends to be quite expensive, so the AnswerTree server version is a major investment. AnswerTree, at about $2,000 for the client or PC version, costs less than at least two main competitors, CART from Salford Software and KnowledgeSeeker from Angoss, Inc. The leading “bargain” in classification tree software appears to be the routine included in Systat program. All of Systat, which does a remarkably wide range of statistical routines, can be had for under $1,500. It includes a full program that does CART analysis (or, thanks to Salford putting an ® on the term CART, C&RT analysis).

AnswerTree is available as a product from a group at SPSS called SPSS BI, which presumably refers to the fact that they help produce business information rather than to any lifestyle preferences.

Range of capabilities, inflexibility of output

AnswerTree has as wide a range of capabilities as any other classification tree analysis program, doing two forms of CHAID analysis, along with CART (or C&RT) and QUEST. We can have little doubt that it is an analytically capable program, especially given its provenance from SPSS. Its server version gives the user the ability to reach directly into a huge database housed on a large computer and get the large computer to do the analytical heavy lifting. AnswerTree is most seriously compromised by the inflexibility of its output — which, while generally attractive, may not meet users’ needs in a variety of situations. It also is limited by an inability to do even the most simple touch-up or transformation operations in the program, and by its inability to inherit settings from another analysis that you already have done. Also, as mentioned earlier, the gains charts have a serious shortcoming in not providing listings of the sets of variables that describe the groups emerging from the analysis.

Users who are not terribly choosy about how their output looks most likely will find a great deal to like in AnswerTree. Also, users of the last version of this program are likely to find this iteration an improvement. Those who are more exacting about their output, though, may find the program very frustrating. The basic analytical tools you need for this form of analysis are all there, but getting the displays to convey the desired information may require an undue amount of work.

A three-paragraph summary

Of the three SPSS products, DeltaGraph 5 appears to have changed the most from its last version. It is faster, sleeker, and better organized. If you are considering a charting and graphing package, and especially if you want nearly complete control over how those charts and graphs look, it is doubtful you could do better than this package. For those who already have owned the product, let’s hope that SPSS responds to your reviewer’s temperate entreaties, gets off the stick, and makes the needed import functions for charts and chart templates from earlier versions.

The upgrades to AnswerTree also have been major. The program boasts a wide range of analytical methods, and seems to have become more speedy and efficient since the last release. However, its lack of flexibility in output may prove to be frustrating to those users who - like your reviewer - want their findings to convey very specific types of information. However, compared with the major commercial alternatives, AnswerTree does appear to offer more analytical power at a less extreme cost. It certainly appears competent in all that it does, and like all classification tree programs, provides analytical power you cannot find in other methods.

The base SPSS product, again, seems more like an incremental upgrade than anything else. Many users will find its new ability to rearrange data files quite helpful, and it is filled with many enhancements and performance improvements. Those of you who want to keep up with the latest developments definitely will want this release.

To close with what may be the most controversial assertion in this entire review, SPSS appears to remain the statistical analysis program that best combines depth of features and ease of use. Those of you who are not worn down to indifference by now, and who have contrasting points of view, are welcome to send rejoinders - very politely worded of course - to the reviewer at quirks.com.