Knowledge for Knowledge’s Sake View source

Documentation

Introduction

In this page you can find all the documentation about the project Knowledge for Knowledge’s Sake. The aim of the project is to investigate how socio-cultural and economic inequalities influence individual access to knowledge and shape avenues for personal growth in STEM and Humanities fields.

You can find all the scripts of our data in the GitHub page of the project.

Scenario

In order to accomplish our research case, we collected data from different sources and re-used it to create our own dataset. We aimed at re-using datasets free of cognitive biases, prejudices and discriminations, fair and reliable, legally valid, relevant, consistent and accurate. Due to the variety of institutional sources we used, this was not easy task. The findings of our investigation explore how socio-cultural and economic inequalities influence individual access to knowledge and shape opportunities for personal growth in STEM and Humanities fields, offering insights that we hope will contribute to public discourse.

Datasets

Original Datasets

The datasets used to investigate how socio-cultural and economic inequalities influence individual access to knowledge include the following data from the respective sources:

Name Source URI Metadata Privacy License Format
OECD PISA 2000-2022 Reports OECD Provided OECD Privacy Guidelines OECD, CC BY 4.0 .sav .sps+.dat .csv
World Bank Development Indicators WBG WDI_EXCEL.xlsx Provided Data Privacy CC BY 4.0 .csv .xlsx
New entrants by education level, programme orientation, sex and field of education Eurostat estat_educ_uoe_ent02_filtered_en.csv Provided Regulation (EU) 2018/1725 CC BY 4.0 .tsv .csv

Mashed-up Dataset

To effectively manage the integration of multiple datasets with varying licensing conditions, we strictly adhered to the EU Open Data Guidelines, ensuring legal, ethical, and technical compliance. Following these guidelines, we aimed to make our research data Findable, Accessible, Interoperable, and Reusable (FAIR) through a structured approach to selection, curation, and publication.

Each dataset was meticulously evaluated for provenance, licensing, quality, and accuracy, ensuring freedom from cognitive biases and full legal validity. All of the datasets used were already anonymized so we didn’t have to to apply privacy-preserving techniques in order to mitigate re-identification risks. To enhance interoperability, we structured datasets according to Linked Open Data (LOD) principles, incorporating DCAT_AP metadata, RDF assertions, and released them under an open license.

Ethical considerations were a key focus, with proactive measures taken to eliminate discrimination, cognitive bias, and unintended prejudices in the dataset selection and processing stages. Additionally, to maximize transparency and accessibility, results were visualized in a human-readable format and published on a one-page website.

As such you can check out the output.xml file we created out of the mash-up.

Quality Analysis

Publicly available datasets on the topic of education-to-work transition are aggregated in such a way that makes it difficult to link socioeconomic status, field of study (ISCED-13) occupation level (ISCO-08) and occupational activity (NACE), except in rare cases—with poor geopolitical and temporal coverage.

More robust analysis would require access to Eurostat Micro Data on the EU Labor Force Survey (EU-LFS) and EU Statistics on Income and Living Conditions (EU-SILC).

In fact, unlike the PISA questionnaires, the public use files (PUFs) of these two key statistical sources are randomized and have no informational value. However, the use of non-publicly available sources is outside the scope of our project.

The data we collected for the purposes of our research derive from different sources and therefore are subject to different types of license, when specified. The datasets we used were either released under the Creative Common License CC BY 4.0 or under the the OECD terms and conditions. These licences allows the user to share and adapt the material, as long as he/she gives appropriate credit, provides a link to the license, and indicates if changes were made. Moreover, the user cannot add additional restrictions.

Ethical Analysis

How sustainable and bias-free are our data providers?

Technical Analysis

Sustainability of the project

The source datasets used for this project are provided by Eurostat, OECD and The World Bank Group, which maintains them in their various respective databases. While the URIs in this project may eventually become obsolete, the data remains accessible through these institutions.

Knowledge for Knowledge’s Sake is the final project developed for the Open Access and Digital Ethics course (a.y. 2024/2025) within the Digital Humanities and Digital Knowledge Master’s Degree (University of Bologna). As such, it is not actively maintained and will not be updated in the future. However, the scripts used to process the data are available under the CC BY 4.0 license and can be rerun with updated versions of the datasets.

Visualisations

The visualizations for the source datasets are available on the secondary-education, university and employment pages. Additionally, the map visualization for the mashup can be found on the map page.

RDF Metadata

We used RDF Diagram Framework from The Institute for Applied Informatics (InfAI) to encode the metadata about all our data, including the original datasets and our mashed-up dataset. Click here to see the code on our GitHub project.