Research predicting National Basketball Association game attendance using a random forest approach is presented. Attendance and other data obtained for the 2009 through 2013 basketball seasons are used. Predictor variables include: home team popularity, popularity of opponent, match type (regular season or playoff), day of the week on which the match occurs, home team winning percentage, home city's total personal income, capacity of home venue, conference of the home team, lagged variables on attendance and on winning percentage, and others. A random forest approach, using the R statistical modeling language, was selected in order to use numerous predictor variables without having to first deselect variables and not to over-fit the data. The random forest prediction is compared favorably with that of a multiple linear regression. Additional results indicate that some variables suggested by sports writers do not contribute much to the prediction and that a better measure of a team's popularity is needed.
Keywords: ensemble method, R, regressionProfessional basketball team managers need to forecast attendance at matches to plan staff, decide on promotions, and estimate revenues. Numerous predictor variables come to mind from both academic literature and from the popular press. Because there is a large set of predictor variables, it is easy to over fit the data. One way to avoid over fitting is to use a prediction technique that minimizes it. Random forest satisfies this need or requirement. It is employed in this research purposely to avoid over fitting the data. In a random forest, each node is split using the best among a subset of predictors randomly chosen at that node.