MuCH-O

Extracting relations between songs and cultural heritage objects and agents with the Music to Cultural Heritage Ontology.

No es mucho, pero es MuCH-O

The project

This projects started with specific competency questions for creafting an ontology able to map connections between a song lyrics and interpretations and other cultural heritage objects and agents. Embark yourself on this journey to see how the mucho-gustore unveils different levels of relations.

Data have been extracted from Spotify, Genius, and Musicbrainz. Sentence-BERT has been employed for candidate ranking and entities have been retrieved and linked to Wikidata and DBpedia. Finally, the recent Google Bard model has been used for relation extraction from Wikipedia pages and WordNet synsets for tracing back relations to standard vocabulary used in the ontology development.

Knowledge graph population and subsequent answering of competency questions will result in understanding rich interconnections of music, providing users with new entities to discover and to be suggested with on the basis of cultural significance.

Competency questions

The project has been shaped by the compenecy questions stated before starting the knowledge extraction process and the ontology development. in this way, the project development started with a clear vision of the data and the modelling structure needed for answering efficiently and pursuing the projects' objectives.

Knowledge extraction has been performed with the intent of proposing a case study using a single song and its lyrics' annotations as starting point: "Resistance" by Muse. The questions have been formulated starting from these ideas.

  • Who are the artists that authored the song?
  • What genres are related to those artists?
  • Which entity/ies are referenced by the song?
  • Which annotations have been identified to be referencing the above entity and what do they contain?
  • Which are the lyrics' fragments that have been annotated with the above annotation?
  • Which entities have been influenced in some way by the entity referenced by the song?
  • Which Cultural heritage objects are related to the entity referenced by the song and by what kind of relation?
  • From where was each of these relations extracted? What does the source text say about them?

Knowledge Extraction

The knoweledge extraction process has been carried out using Python3 and different libraries for data extraction, sentence similarity, synsets similarity, candidate entity ranking, knowledge graph population, etc.

The whole process has been performed using a Jupyter Notebook and, therefore, it can be fully analysed. This section provides an overvew of the process including the notebook snippets for a better in-depth analysis.

00. Imports

For what concerns the imported libraries and packages, requests has been used for performing general API requests to Spotify and Wikidata, while musicbrainzngs has been used for retrieving song data from MusicBrainz, and lyricsgenius for connecting with the Gesius' APIs. Moreover, also qwikidata and SPARQLWrapper have been used for connceting directly with the Wikidata and DBpedia SPARQ endpoints.

nltk and spaCy have been imported as main libraries for NLP tasks including the ones dealing with WordNet synsets. Finally, BeautifulSoup4 has been imported for scraping Wikipedia pages, while Bard-API helped in sending prompts to Google Bard for relation extraction tasks and sentence_transformers for computing sentence similarity scores.

01. Getting data

The starting point is the song ID that can be retrieved by either right-clicking on the track in the Spotify app and selecting “Share”, “Copy Song Link”, or directly from the URL of the song's page. With this ID it is possible to connect with the Spotify APIs for retrieving enough data for the song, the artists performing it, etc. From Spotify, data such as song external IDs, artists, and genres have been extracted.

In the notebook, also information that has not been used for the project has been extracted for the provision of an example about the potential amount of downloadable data. For example, lyrics, comments, description, and annotations have been retrieved, while only the first and the last have been used for extracting extra information from text given their reliability.

From MusicBrainz, additional information about the artists and the recording have been stored in the dictionary containing the song's information.

02. Information extraction from song annotation

The information extraction part dealt with the Named Entity Recognition (NER). In particular, spaCy has been used for extracting named entities from the song annotations. For each entity extracted, the focus was dedicated onty towards entities recognized as being "work of art" and "person".

Once stored the entities in a dictionary, another iteration over the annotations enabled the storing of the number of occurrences of the named entity string in the texts. Also, the number of the annotations in which the references appeared haeve been retrieved and stored.

03. Wikidata candidate entities retrieval

In this phase, the main goal is to understand which are the existent entities cited and extracted as "works of art" in the previous step. Candidate entities are searched among Wikidata entities by means of their label. A connection with the Wikidata API was performed to get possible results according to the match between the string of the Named Entity recognised in the annotations and the Wikidata entities' labels.

The resusts obtained as a list of dictionaries representing the entities was filtered according to a simple principle: since the interest is on works of art, the retrieved entities that are not of type "creative work" or any subclass applied recursively were removed from the list. Therefore, the final candidates are stored in a list of creative works ordered according to Wikidata relevance.

04. Candidates ranking and disambiguation with Wikidata and Sentence-BERT

Here, the objectives were: 1) understanding whether the Named Entities retrieved are actually works of art mentioned in the song's annotation or just an error given by the NER model, and 2) in the first scenario, ranking the candidate entities and selecting the correct one according to probability.

With these aims, a custom score was compued for each candidate entity and for each work of art involved. The score is the sum of the following sub-scores:

  • Relevance given by the order of the results returned by the Wikidata API
  • Number of Named Entities recognized as "person" in the annotations that are also objects of Wikidata triples having the candidate entity as subject
  • Number of Named Entities recognized as "person" in the annotations and occurring in the Wikidata description of the candidate entity
  • A dependency score assigning a value of -1 to each candidate entity that is a derivative work of another candidate entity according to Wikidata
  • A score computed using Sentence-BERT to assess the similarity between the sentences in which the Named Entity appeared in the annotations and the Wikidata entity description

If the final scores lies below a certain threshold, the candidates are removed for being not relevant. Among the others, only the one with the hughest score is selected.

05. Searching for links to other entities on Wikipedia and DBpedia

Once selected the entity with the highest score, the Wikipedia URL is taken from Wikidata. The URL will serve as "ID" for retrieving the DBpedia entity representing the page. From here, the existent links to other Wikipedia pages were extracted. Among these, only the ones related to creative works or artists were kept.

With the BeautifulSoup4 library, the Wikipedia page was scraped but only considering subsections related to cultural influence or adaptations of the selected candidate. From here, all the sentences containing the aforementioned links were stored.

Each link was also stored in a dictiornay where other values such as internal IDs, the DBpedia entity they refer to, their DBpedia class, etc. were stored. The internal id was used for facilitating the process of relation extraction by replacing the orinial HTML anchor tag.

06. Relation extraction from plain text using Google Bard and WordNet

The Google Bard LLM was used for extracting the rlationships between the selected candidate entity and the other entities retrieved using DBpedia and present in the selected Wikipedia pieces of text. In particular, the prompt consisted in asking to extract the relations and return the result as a pandas dataframe able to represent the connections.

Since the model return any kind of relation between th einvolved entities, it was necessary to trace back each relation to a standard construct (e.g., "based on", "inspired by", "adapted from", etc.). Half of the work was done by using regular expressions, while the rest involved WordNet synsets: for each relation extracted, the verb was isolated and every synset of that verb was compared with the synsets of each of the standard constructs' verbs to compute their similarity. the highest similarity assigned the construct to the relation extracted if the score was above a certain treshold. Otherwise, the relation was classified as of "general influence".

Onotlogy and Knowledge Graph population

The MuCH-ontology has been developed starting from the competency questions and considering the availability of data retrieved as explained in the previous section. The starting point is music, and songs in particular, while special attention was dedicated to the relations between entities. For these reasons, the music-meta ontology and the PROV ontology were selected as starting point for describing information objects and music entities, and influence relations respectively.

Specific classes of these ontologies were used and extended for the project's purpose. For what concerns relations, the PROV ontology allows for connecting differnt entities by means of simple object properties, but also by the so called "qualified influences". These are more complex constructs that were reused for specifi+ying additional information about the retrieved influence. Naturally, these relations have been extended to include the phenomena relevant for the project.

Population of the mucho-gustore Knowledge Graph

The mucho-gustore Knowledge Graph has been created and populated using RDFLib and reusing information from the dictionaries storing data related to the song, the annotations, and the related entities. URIs representing classes and properties have been defines in a Python file external to the invoved notebook and named URIs.py. All the variables contained in this file and referencing the URIs of classes and properties have been imported at the beginning of the population process along with theaforementioned python library.

Results

As for the results, we test the ontology with respect to the competency questions by translating them in SPARL queries to perform in order to retrieve information from the mucho-gustore knowledge graph.

A specific notebook has been used for performing the queries and analysing the results. The RDFLib library alosne was enough to serve this purpose.

01. Who are the artists that authored the song?

Results

Artist: https://www.wikidata.org/wiki/Q22151

Label:Muse

02. What genres are related to those artists?

Results

Artist: Muse

Genre: modern rock; permanent wave; rock.

03. Which entity/ies is/are referenced by the song?

Results

Entity: https://www.wikidata.org/wiki/Q208460

Type: LiteraryEntity

Title: 1984

04. Which are the annotations that have been identified to be referencing the above entity and what do they contain?

Results *

Annotation: annotation_8

Text: If a ceiling, roof or other structure caves in, it breaks and falls into the space below. This is just figuratively implying their worries that their hiding place may be exposed and they may get arrested for romantic love, which is a crime in the world of “1984”.

* Just one result has been included in the example for space-related issues.

05. Which are the lyrics fragments that have been annotated with the above annotation (therefore referencing in some way the involved entity)?

Results *

Lyrics fragment: fragment_8

Text: Or will the walls start caving in?

Annotation: annotation_8

* Just one result has been included in the example for space-related issues.

06. Which entities have been influenced in some way by the entity referenced by the song?

Results *

Entity: http://dbpedia.org/resource/2_+_2_=_5_(song)

Type: Song

Label: 2 + 2 = 5

Entity: http://dbpedia.org/resource/The_Jam

Type: Group

Label: The Jam

* Only some of the results has been included in the example for space-related issues.

07. Which Cultural heritage objects are related to the entity referenced by the song (and therefore to the song with a two step connection) and by what kind of relation?

Results *

Entity: http://dbpedia.org/resource/The_God_Complex

Type: AudiovisualEntity

Label: The God Complex

Relation: cites

Entity: http://dbpedia.org/resource/Diamond_Dogs

Type: MusicAlbum

Label: Diamond Dogs

Relation: wasInfluencedBy

* Only some of the results has been included in the example for space-related issues.

08. From where was each of these relations extracted? What does the source text say about them?

Results *

Entity: http://dbpedia.org/resource/2_+_2_=_5_(song)

Label: 2 + 2 = 5 (song)

Relation type: EntityInfluence

Source: https://en.wikipedia.org/wiki/Nineteen_Eighty-Four

Source text: Radiohead's 2003 single "2 + 2 = 5", from their album Hail to the Thief, is Orwellian by title and content. Thom Yorke states, "I was listening to a lot of political programs on BBC Radio 4. I found myself writing down little nonsense phrases, those Orwellian euphemisms that [the British and American governments] are so fond of. They became the background of the record."

* Just one result has been included in the example for space-related issues.