Unstructured Data in Empirical Economics


5-9 September 2022


15:30 to 18:30 CEST



Intended for

Academic researchers and policy analysts who are using, or who wish to use, unstructured data sources such as text, detailed surveys, and financial transaction data in their work.


A basic familiarity with probability and statistics to advanced undergraduate level. The hands-on classes will require students to work through Python notebooks that will be prepared in advance. Extension problems will involve the modification of these notebooks, which requires a familiarity with the basics of Python.


Over the past decade, the use of unstructured data has been growing steadily in economics and related disciplines, with a rapid acceleration in the wake of COVID-19. This course will begin with an overview of types of unstructured data and recent applications. We will then cover the basics of processing and filtering such data to produce reliable measurements of economic phenomena.

One of the key properties of many unstructured datasets is the vast number of features available for each observation. The remainder of the course will introduce strategies for exploiting this richness, most of which come from the machine learning (ML) literature.

We will first introduce “off-the-shelf” ML methods that empirical economists have productively used in recent research, including matrix factorization, word embeddings, and topic models. The motivating examples for these methods will largely come from text data but we will also discuss other applications including survey data.

The course will conclude by introducing methods for building and estimating new models that link unstructured data to the economic environment of interest more closely than is possible with off-the-shelf models. This will draw on recently introduced probabilistic programming languages that make feasible complex inference problems.

The core ideas from lectures will be complemented by hands-on classes during which students will work through the application of the above techniques to actual datasets.


Preprocessing and filtering unstructured data
Basic index construction
Embedding models: latent semantic analysis, word2vec, BERT
Probability models for discrete data: Dirichlet and multinomial distributions; multinomial inverse regression model; latent Dirichlet allocation
Automatic inference for probability models

Stephen Hansen is Associate Professor of Economics at Imperial College Business School, and Fellow of the Centre for Economic Policy Research and CESifo. He previously held positions at the University of Oxford and Pompeu Fabra University and received his PhD in Economics from the London School of Economics in 2009. He previously served as an academic consultant at the Bank of England, and a Fellow at the Alan Turing Institute. He currently sits on the scientific advisory board of the Ifo Institute and holds European Research Council Consolidator and Proof-of-Concept Grants. His research has been published in leading international journals, including the Quarterly Journal of Economics, Journal of Political Economy, Review of Economic Studies, and Journal of Monetary Economics.