Documentation

Introduction

In this page you can find all the documentation about the project Knowledge for Knowledge’s Sake. The aim of the project is to investigate how socio-cultural and economic inequalities influence individual access to knowledge and shape avenues for personal growth in STEM and Humanities fields.

You can find all the scripts of our data in the GitHub page of the project.

Scenario

In order to accomplish our research case, we collected data from different sources and re-used it to create our own dataset. We aimed at re-using datasets free of cognitive biases, prejudices and discriminations, fair and reliable, legally valid, relevant, consistent and accurate. Due to the variety of institutional sources we used, this was not easy task. The findings of our investigation explore how socio-cultural and economic inequalities influence individual access to knowledge and shape opportunities for personal growth in STEM and Humanities fields, offering insights that we hope will contribute to public discourse.

Datasets

Original Datasets

The datasets used to investigate how socio-cultural and economic inequalities influence individual access to knowledge include the following data from the respective sources:

Name	Source	URI	Metadata	Privacy	License	Format
OECD PISA 2000-2022 Reports	OECD		Provided	OECD Privacy Guidelines	OECD, CC BY 4.0	.sav .sps+.dat .csv
World Bank Development Indicators	WBG	WDI_EXCEL.xlsx	Provided	Data Privacy	CC BY 4.0	.csv .xlsx
New entrants by education level, programme orientation, sex and field of education	Eurostat	estat_educ_uoe_ent02_filtered_en.csv	Provided	Regulation (EU) 2018/1725	CC BY 4.0	.tsv .csv

Mashed-up Dataset

To effectively manage the integration of multiple datasets with varying licensing conditions, we strictly adhered to the EU Open Data Guidelines, ensuring legal, ethical, and technical compliance. Following these guidelines, we aimed to make our research data Findable, Accessible, Interoperable, and Reusable (FAIR) through a structured approach to selection, curation, and publication.

Each dataset was meticulously evaluated for provenance, licensing, quality, and accuracy, ensuring freedom from cognitive biases and full legal validity. All of the datasets used were already anonymized so we didn’t have to to apply privacy-preserving techniques in order to mitigate re-identification risks. To enhance interoperability, we structured datasets according to Linked Open Data (LOD) principles, incorporating DCAT_AP metadata, RDF assertions, and released them under an open license.

Ethical considerations were a key focus, with proactive measures taken to eliminate discrimination, cognitive bias, and unintended prejudices in the dataset selection and processing stages. Additionally, to maximize transparency and accessibility, results were visualized in a human-readable format and published on a one-page website.

As such you can check out the output.xml file we created out of the mash-up.

Quality Analysis

Publicly available datasets on the topic of education-to-work transition are aggregated in such a way that makes it difficult to link socioeconomic status, field of study (ISCED-13) occupation level (ISCO-08) and occupational activity (NACE), except in rare cases—with poor geopolitical and temporal coverage.

More robust analysis would require access to Eurostat Micro Data on the EU Labor Force Survey (EU-LFS) and EU Statistics on Income and Living Conditions (EU-SILC).

In fact, unlike the PISA questionnaires, the public use files (PUFs) of these two key statistical sources are randomized and have no informational value. However, the use of non-publicly available sources is outside the scope of our project.

OECD: The OECD ensures the quality of its PISA data through rigorous methodologies and robust sampling techniques. The data is collected, validated, and analyzed following the OECD Privacy Guidelines. Content published from 1 July 2024 is released under the CC BY 4.0 license, while content published before this date follows the the OECD terms and conditions. The credibility of PISA data is reinforced by its widespread use in education policy and research.
WBG: The quality and integrity of the data produced by the World Bank Group is guaranteed by its Development Data Group, which adopts professional standards in the collection, compilation and dissemination of data. It should be noticed the World Bank’s effort in helping developing countries improving the efficiency of their national statistical systems, since most of the data come from the member countries’ systems. According to the Data Access And Licensing page, the dataset is provided under a CC BY 4.0 license, ensuring open access to reliable economic and development data.
Eurostat: Eurostat maintains high-quality control over this dataset, ensuring compliance with Regulation (EU) 2018/1725 to safeguard data privacy and integrity. The data is collected from national statistical agencies and undergoes Eurostat’s rigorous validation procedures. According to the Copyright notice and free re-use of data the datasets provided by eurostat are licensed under CC BY 4.0 which enables accessibility while maintaining transparency and methodological rigor.

Legal Analysis

The data we collected for the purposes of our research derive from different sources and therefore are subject to different types of license, when specified. The datasets we used were either released under the Creative Common License CC BY 4.0 or under the the OECD terms and conditions. These licences allows the user to share and adapt the material, as long as he/she gives appropriate credit, provides a link to the license, and indicates if changes were made. Moreover, the user cannot add additional restrictions.

OECD:
- Privacy: OECD Privacy Guidelines.
- License: According to the terms and conditions datasets provided by OECD are licensed under OECD before 1 July 2024 and under CC BY 4.0 after 1 July 2024.
- Purpose: The PISA database contains the full set of responses from individual students, school principals and parents. These files will be of use to statisticians and professional researchers who would like to undertake their own analysis of the PISA data.
WBG:
- Privacy: Data Privacy
- License: CC BY 4.0
- Purpose: The World Development Indicators (WDI) is the primary World Bank collection of development indicators, compiled from officially-recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates.
Eurostat:
- Privacy: Eurostat follows the Regulation (EU) 2018/1725
- License: CC BY 4.0
- Purpose: The dataset educ_uoe_ent02 holds the number of new entrants by education level, programme orientation, sex and field of education

Ethical Analysis

How sustainable and bias-free are our data providers?

OECD: The OECD Privacy Guidelines govern the ethical handling of data, ensuring privacy protection, transparency, and responsible data use. While the OECD provides open access to data, its Terms and Conditions specify that some datasets may have third-party ownership or additional restrictions. Users must verify metadata for such limitations. The organization maintains high standards in data collection and dissemination, ensuring accuracy and neutrality in its reporting.
WBG: The World Bank Group ensures ethical standards, overseen by the Ethics and Business Conduct Department (EBC), in its data practices through the Creative Common License CC BY 4.0. According to their Terms of Use for Datasets, datasets are made publicly accessible with proper attribution, though some third-party data may have additional restrictions. The organization promotes transparency, but disclaims any warranty regarding the accuracy or utility of the data. Users are responsible for verifying data ownership and must not misrepresent the World Bank’s endorsement. Disputes over data use are handled through a defined mediation and arbitration process.
Eurostat: Eurostat follows rigorous ethical and professional standards, ensuring that its statistical processes align with the European Statistics Code of Practice. The organization prioritizes transparency, impartiality, and accuracy, carefully assessing the potential biases in national data sources. Additionally, Eurostat adheres to Regulation (EU) 2018/1725, ensuring data privacy and sustainability while maintaining high-quality, bias-free statistical reporting.

Technical Analysis

OECD:
- Format: sav.
- Metadata: OECD provide metadata that can be found using the following link.
- URI: EdSurvey used to download the datasets.
- Provenance: PISA data and methodology
WBG:
- Format: csv and xlsx.
- Metadata: WBG provide a JSON file containing all the information from the dataset, the file can be downloaded using this link, there is also metadata relative to Data Source and Last Updated Date directly inside the xlsx file.
- URI: WDI_EXCEL_2024_12_16.zip, WDI_CSV_2024_12_16.zip
- Provenance: World Development Indicators
Eurostat:
- Format: csv and xlsx
- Metadata: estat_educ_uoe_ent02
- URI: estat_educ_uoe_ent02
- Provenance: estat_educ_uoe_ent02

Sustainability of the project

The source datasets used for this project are provided by Eurostat, OECD and The World Bank Group, which maintains them in their various respective databases. While the URIs in this project may eventually become obsolete, the data remains accessible through these institutions.

Knowledge for Knowledge’s Sake is the final project developed for the Open Access and Digital Ethics course (a.y. 2024/2025) within the Digital Humanities and Digital Knowledge Master’s Degree (University of Bologna). As such, it is not actively maintained and will not be updated in the future. However, the scripts used to process the data are available under the CC BY 4.0 license and can be rerun with updated versions of the datasets.

Visualisations

The visualizations for the source datasets are available on the secondary-education, university and employment pages. Additionally, the map visualization for the mashup can be found on the map page.

RDF Metadata

We used RDF Diagram Framework from The Institute for Applied Informatics (InfAI) to encode the metadata about all our data, including the original datasets and our mashed-up dataset. Click here to see the code on our GitHub project.