Data Science for Economics: Mastering Unstructured Data

Dates

2-6 September 2024

Hours

15:00 to 18:30 CEST

Format

In person

Intended for

Academic researchers, policy analysts, data analysts, and consultants who are using, or who wish to use, unstructured data sources such as text, detailed surveys, images, or speech data in their work.

Prerequisites

A basic familiarity with probability and statistics at advanced undergraduate level. The hands-on classes will require students to work through Python notebooks that will be prepared in advance. Extension problems will involve the modification of these notebooks, which requires familiarity with the basics of Python. An introductory session to Python will be provided by a teaching assistant. Therefore, previous programming experience in other languages is sufficient.

Overview

Over the last decade, there has been a significant increase in the utilization of unstructured data such as text and images within the field of economics and its allied disciplines. With the advent of technologies like ChatGPT and other large language models, our capacity to glean insights from such unstructured data has expanded, opening up a myriad of exciting opportunities. This course aims to equip participants with a comprehensive understanding of how to conduct unstructured data analysis, empowering them to integrate these innovative tools into their research and projects.

The curriculum is structured around four key components:

1. Introduction to analytical techniques: The course begins with an exploration of statistical and machine learning techniques pertinent to the analysis of unstructured data. Topics covered will encompass Bayesian updating, matrix factorization, and predictive modeling through neural networks. The focus will be on demystifying the underlying principles of contemporary algorithms, rather than delving into exhaustive theoretical details.

2. Economic applications: We will delve into how these algorithms are applied within the realm of economics, showcasing examples from recent empirical studies. Discussions will also cover strategic considerations for selecting appropriate methods based on specific research needs.

3. Practical implementation: A significant portion of the course involves hands-on training, where participants will engage with real datasets using Python. Through detailed coding sessions, students will gain the skills necessary to not only understand and replicate leading-edge research but also to incorporate advanced methods into their own projects.

4. Data collection and preparation: Lastly, the course will introduce methodologies for collecting unstructured data and converting it into formats amenable to analysis with the learned techniques.

By focusing on practical skills and providing the intuition of the theoretical underpinnings, this course aims to bring participants to the forefront of unstructured data analysis, enabling them to harness the full potential of modern analytical tools in their research.

Topics

  • Descriptive text analysis: The bag-of-words model, tf-idf, dictionary methods
  • Probability models for discrete data: latent Dirichlet allocation
  • Word embedding models: word2vec, GloVe
  • Large language models and fine tuning
  • How to evaluate predictions
  • Neural networks for image classification
  • Web scraping
  • Speech to text; analysis of speech tone

Practical Classes

This course will have voluntary practical classes taught by a Teaching Assistant. These classes will cover the setup of relevant software and how to run some of the code used in the course.

Christopher Rauh is Professor of Economics and Data Science at the University of Cambridge, a Fellow of Trinity College Cambridge, and a Research Affiliate at CEPR, HCEO, IZA, and PRIO. He is a co-founder of conflictforecast.org, a website providing monthly updates of conflict risk across the globe. He has conducted multiple projects with the FCDO, German Foreign Office, and IMF, and has published in a wide range of journals in Economics and Political Science.

Back