CEMFI Summer School

Unstructured Data in Empirical Economics


4-8 September 2023


15:30 to 18:30 CEST



Intended for

Academic researchers and policy analysts who are using, or who wish to use, unstructured data sources such as text, detailed surveys, and financial transaction data in their work.


A basic familiarity with probability and statistics at advanced undergraduate level. The hands-on classes will require students to work through Python notebooks that will be prepared in advance. Extension problems will involve the modification of these notebooks, which requires familiarity with the basics of Python.


Over the past decade, the use of unstructured data like text and images has been growing steadily in economics and related disciplines, with a rapid acceleration in the wake of COVID-19. Recent algorithmic advances have also yielded new tools like ChatGPT and GPT-4, whose ability to extract information from unstructured data creates exciting new possibilities for research. This course will provide an overview of the last ten years of research in unstructured data analysis, and bring students to the research frontier so that they can use these tools in their own work.

The course will combine three important elements. The first is an introduction to statistical and machine learning methods relevant for studying unstructured data. These include basic ideas in Bayesian statistics, matrix factorization, and neural networks. The goal is not to provide a formal treatment of these topics but rather to provide some guidance on the foundations for modern algorithms.

The second element is to discuss applications in economics. For each class of algorithm, we will provide examples of recent applications in empirical literature. We will also discuss the factors that might guide the choice of one algorithm over another.

The third element is the implementation of algorithms on real data. Students will learn this via hands-on practical classes in which we will use the Python programming language. These coding exercises will be detailed enough to allow students to replicate frontier papers and to begin using modern algorithms in their own work.


The bag-of-words model
Probability models for discrete data: multinomial inverse regression; latent Dirichlet allocation
Word embedding models: word2vec, GloVe
Large language models: BERT, GPT-3/4, ChatGPT
Convolutional neural networks for image classification

Stephen Hansen is Professor of Economics at University College London and Fellow of the Centre for Economic Policy Research and CESifo. He previously held positions at Imperial College London, the University of Oxford, and Pompeu Fabra University. He received his PhD in Economics from the London School of Economics in 2009. He previously served as an academic consultant at the Bank of England, and a Fellow at the Alan Turing Institute. He currently sits on the scientific advisory board of the Ifo Institute and holds European Research Council Consolidator and Proof-of-Concept Grants. His research has been published in leading international journals, including the Quarterly Journal of Economics, Journal of Political Economy, Review of Economic Studies, and Journal of Monetary Economics.