Webinar Content
In December 2022, Eurostat launched its first deduplication challenge as part of the European Statistics Awards Programme Web Intelligence competitions. It revolved around identifying potential duplicates within job postings sourced from the web. Producing high-quality European statistics from online job advertisements is a significant concern, as companies often publish job offers on different web portals.
This webinar will present the work of two teams that participated in the challenge: Spub.Fr, a collaboration between Insee and Dares, and Nins, which received the second prize for reproducibility. It will insist on the methodological lessons and difficulties that could be learned from this experience.
On the program:
- Different methods to identify duplicates in a multilingual dataset, with the use case of the job advertisements. This will include Named-Entity Recognition, transformer-based approaches to compare the similarity of the offers vector embeddings, or MinHash experimentations.
- Examples of best practices to conduct a data science project (with the example of the deduplication challenge), such as using the Kedro framework for Python and a presentation of the Onyxia Datalab.
Learning Objectives
- An overview of deduplication approaches, notabily from mulitilingual datasets.
- Examples of best practices for a data science project.
Who will benefit from the webinar?
Statisticians ( or similar) with some notions in natural language processing (NLP) and with Python interested in deduplication.
Trusted Smart Statistics – Web Intelligence Network
Background
The Web Intelligence Network (WIN) project is creating an environment where an array of non-traditional data sources can be accessed by members of the European Statistical System (ESS) and beyond. This project works hand in hand with the Web Intelligence Hub (WIH), a platform where data is accessed.
The Web Intelligence Hub (WIH) is designed to provide National Statistical Institutes (NSIs) exposure to new data sets, enabling data sharing within a trusted, secure environment. New data sets collated from modern technologies can help produce statistics more cost-effectively by harnessing the power of the digital era.
The WIH will harness technology and, in turn, allow collaboration to benefit the modernisation of official statistics. The WIH will create a community of collaboration to:
- Learn together
- Share knowledge, data, and methods
- Build a partnership between NSIs
- Enable cooperation and sharing of statistical solutions for NSIs