Lessons learned from Eurostat's Deduplication Challenge

Lessons learned from Eurostat's Deduplication Challenge

This webinar will present the work of two teams that participated in the challenge.

By ESSnet Web Intelligence Network Project

Date and time

Mon, 13 May 2024 06:00 - 08:00 PDT

Location

Online

About this event

  • 2 hours

Webinar Content

In December 2022, Eurostat launched its first deduplication challenge as part of the European Statistics Awards Programme Web Intelligence competitions. It revolved around identifying potential duplicates within job postings sourced from the web. Producing high-quality European statistics from online job advertisements is a significant concern, as companies often publish job offers on different web portals.

This webinar will present the work of two teams that participated in the challenge: Spub.Fr, a collaboration between Insee and Dares, and Nins, which received the second prize for reproducibility. It will insist on the methodological lessons and difficulties that could be learned from this experience.

On the program:

  • Different methods to identify duplicates in a multilingual dataset, with the use case of the job advertisements. This will include Named-Entity Recognition, transformer-based approaches to compare the similarity of the offers vector embeddings, or MinHash experimentations.
  • Examples of best practices to conduct a data science project (with the example of the deduplication challenge), such as using the Kedro framework for Python and a presentation of the Onyxia Datalab.

Learning Objectives

  • An overview of deduplication approaches, notabily from mulitilingual datasets.
  • Examples of best practices for a data science project.

Who will benefit from the webinar?

Statisticians ( or similar) with some notions in natural language processing (NLP) and with Python interested in deduplication.

Trusted Smart Statistics – Web Intelligence Network

Background

The Web Intelligence Network (WIN) project is creating an environment where an array of non-traditional data sources can be accessed by members of the European Statistical System (ESS) and beyond. This project works hand in hand with the Web Intelligence Hub (WIH), a platform where data is accessed.

The Web Intelligence Hub (WIH) is designed to provide National Statistical Institutes (NSIs) exposure to new data sets, enabling data sharing within a trusted, secure environment. New data sets collated from modern technologies can help produce statistics more cost-effectively by harnessing the power of the digital era.

The WIH will harness technology and, in turn, allow collaboration to benefit the modernisation of official statistics. The WIH will create a community of collaboration to:

  • Learn together
  • Share knowledge, data, and methods
  • Build a partnership between NSIs
  • Enable cooperation and sharing of statistical solutions for NSIs

Presenter Biographies:

Antoine Palazzolo is a data scientist from the innovation lab of Insee (French National Statistics Institute), where he mainly works on projects related to Natural Language Processing, such as the automatic codification of professions in national surveys. He is also a member of the Work Package 4 of the WIN, within the Architecture group.

Privacy notice

The data you provide will be used by the Office for National Statistics to contact you in connection with the above event and to understand the attendance level and interest in the event. ONS will store your data securely on our IT infrastructure and your data will not be shared with other organisations. Personal data will be kept until the end of the project (March 2025) and then deleted. From time to time, we will contact you to keep you up to date with new events we have planned.

If you have a question about how we process your personal data and you can’t find the answer on our website, you can contact our Data Protection Officer at DPO@statistics.gov.uk or 0845 6013034. To find out more about your rights under data protection legislation, or how to raise a concern with the Information Commissioner, see our website www.ons.gov.uk/dataprotection or the Information Commissioners Office at https://ico.org.uk.

Your data will be processed by Eventbrite – please read their privacy policy.

Organised by

Welcome to the European Statistical System Collaborative Network (ESSnet) Web Intelligence Network project, which began in April 2020 and will run until March 2025. This is a continuation of the previous ESSnet Big Data and ESSnet on Big Data II projects, which commenced in 2016. The aim of the network is to contribute to National Statistical Institutes (NSI) understanding the need to change and advance the production of national statistics. New technologies and data sources have tremendous potential to improve statistical production. They offer a way to generate statistics in an effective, accurate, and cost-effective manner.