【主講】楊漠塵,印第安納大學凱萊商學院助理教授
【主題】通過随機森林産生工具變量來解決數據挖掘變量預測(度量)錯誤産生的内生性問題
【時間】2018年11月29日(周四)15:00-16:30pm
【地點】清華經管學院偉倫樓513
【語言】英語
【主辦】管理科學與工程系
【簡曆】楊漠塵老師簡曆
【Speaker】Monchen Yang, Indiana University Kelley School of Business,Assistant Professor
【Topic】Generating Instrumental Variables via Random Forest to Address Endogeneity due to Prediction (Measurement) Error in Data-Mined Variables
【Time】Thursday, Nov. 29, 2018, 15:00-16:30pm
【Venue】Room 513, Weilun Building, Tsinghua SEM
【Language】English
【Organizer】Department of Management Science and Engineering
【Abstract】The practice of combining machine learning with econometric analysis is increasingly prevalent in both research and practice. In this work, we address one common example: the use of predictive modeling techniques to "mine" variables of interest from unstructured data, e.g., predicting sentiment from text, followed by the inclusion of those variables into an econometric framework, with the objective of making statistical inferences. We consider recent work, which highlights that, because the predictions from machine learning models are inevitably imperfect, econometric analyses involving the predicted variables will suffer from biases and endogeneity deriving from measurement error. We propose a novel approach that mitigates these biases, leveraging instrumental variables generated from an ensemble learning technique known as the random forest. The random forest algorithm performs best when comprised of a set of trees that are individually accurate in their predictions, and which make "different" mistakes, i.e., have weakly correlated prediction errors. A key observation is that these properties are close analogs for the relevance and exclusion requirements for a valid instrumental variable. We design a data-driven procedure to select tuples of individual trees from a random forest, in which one tree serves as the endogenous covariate and the other trees as its instruments. Simulation experiments demonstrate the efficacy of the proposed approach in mitigating estimation biases, and its superior performance relative to an alternative method (simulation-extrapolation) proposed in prior work for addressing this problem.