Harkness Hall, Rochester, NY, 14611

Join the Goergen Institute for Data Science  and the Department of Political Science for Statistical Analysis with Machine Learning Predicted Variables with Soichiro Yamauchi, Data Scientist with Google. 

Abstract: Scholars in the social sciences are increasingly relying on machine learning (ML) techniques to construct data from large corpora of text and images. The ML-generated variables are subsequently utilized in statistical analysis to address substantive questions through regression and hypothesis testing. However, this approach can introduce substantial bias and lead to incorrect inferences due to prediction errors during the machine learning stage. In this paper, we present an approach that incorporates ML-generated variables into regression analysis while ensuring consistency and asymptotic normality. The proposed approach leverages a small-scale human-coded sample to capture the bias in the naive estimator, without the need for strict assumptions about the structure of prediction errors. Furthermore, we have developed diagnostic tools to assess whether additional human coding can further reduce variance in the main analysis. We illustrate the effectiveness of our method by revisiting a study on the sources of election fraud with ballot image data and regression analysis.

Bio: Soichiro Yamauchi is a political methodologist. He currently holds the position of Data Scientist at Google. He received his PhD from the Department of Government at Harvard University in 2022. His research interest includes causal inference with panel data, survey methodology, and statistical inference with machine learning generated data.

This seminar is part of a tenure-track, Assistant Professor of Computational Social Science faculty search led by the Goergen Institute for Data Science, in collaboration with the Departments of Political Science, Linguistics, and Economics.

Event Details

Please click the link below to join the webinar:

Meeting ID: 991 9881 8526

User Activity

No recent activity