Please download to get full document.

View again

of 8

Adaptation of string matching algorithms for identification of near-duplicate music documents

Adaptation of string matching algorithms for identification of near-duplicate music documents
0 views8 pages
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
  Adaptation of String Matching Algorithms forIdentification of Near-Duplicate Music Documents Matthias Robine, Pierre Hanna, Pascal Ferraro, Julien Allali To cite this version: Matthias Robine, Pierre Hanna, Pascal Ferraro, Julien Allali. Adaptation of String MatchingAlgorithms for Identification of Near-Duplicate Music Documents. Workshop on PlagiarismAnalysis, Authorship Identification, and Near-Duplicate Detection (PAN07), Jul 2007, Ams-terdam, Netherlands. pp.37–43, 2007.  < hal-00306567 > HAL Id: hal-00306567 Submitted on 28 Jul 2008 HAL  is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.L’archive ouverte pluridisciplinaire  HAL , estdestin´ee au d´epˆot et `a la diffusion de documentsscientifiques de niveau recherche, publi´es ou non,´emanant des ´etablissements d’enseignement et derecherche fran¸cais ou ´etrangers, des laboratoirespublics ou priv´es.  Adaptation of String Matching Algorithms for Identification of Near-Duplicate Music Documents Matthias Robine, Pierre Hanna, Pascal Ferraro and Julien Allali LaBRI - Universite de Bordeaux 1F-33405 Talence cedex, France ABSTRACT The number of copyright registrations for music documentsis increasing each year. Computer-based systems may helpto detect near-duplicate music documents and plagiarisms.The main part of the existing systems for the comparisonof symbolic music are based on string matching algorithmsand represent music as sequences of notes. Nevertheless,adaptation to the musical context raises specific problemsand a direct adaptation does not lead to an accurate de-tection algorithm: indeed, very different sequences can rep-resent very similar musical pieces. We are developing animproved system which mainly considers melody but takesalso into account elements of music theory in order to de-tect musically important differences between sequences. Inthis paper, we present the improvements proposed by oursystem in the context of the near-duplicate music documentdetection. Several experiments with famous music copyrightinfringement cases are proposed. In both monophonic andpolyphonic context, the system allows the detection of pla-giarisms. 1. INTRODUCTION The number of music documents available on the WorldWide Web is highly increasing. Each year, over 10000 newalbums of recorded music are released and over 100000 newmusical pieces are registered for copyright [19]. For exam-ple, the total number of musical pieces registered in Franceby the French professional association SACEM, protectingartist rights, reached 250000 pieces [7] in 2004. One of the role of this organization is to help justice to take de-cision about plagiarism complaints. Plagiarism is the actof copying or including another author idea without properacknowledgment. It is important to note that a plagiarismcan only be decided by justice. Some famous proceedingsabout plagiarism happen in the last few years: Madonnaand Salvatore Acquaviva in Belgium, Georges Harrison andThe Chiffons in UK,  Les feuilles mortes   and  La Maritza   inFrance, etc. In 2004, SACEM had only verified 18000 (out SIGIR ’07 Amsterdam. Workshop on Plagiarism Analysis,Authorship Identification, and Near-Duplicate Detection  of 250000) musical pieces in order to determine their srci-nality. A complete musical analysis is performed by expertsonly if a complaint is lodged. Considering the importantnumber of new music documents registered every year, itis difficult to check for possible plagiarism. For example,a SACEM member recently registered a piece that was theperfect copy of a Ravel’s piece. However, it is impossible tolisten and manually compare all the music document regis-tered.Some studies in the context of the Music Information Re-trieval research area deal with computer-based techniquesthat may help listeners to retrieve near-duplicate music doc-uments and may help justice to determine plagiarisms. Theseinvestigations mainly concern the open problem of the es-timation of the music similarity. The notion of similarityis very difficult to define precisely and the music similarityremains one of the most complex problem in the field of themusic information retrieval. This notion may strongly de-pend on the musical culture, on personal opinion, on mood,etc.From a computational point of view, evaluating the similar-ities consists of computing a similarity measure between apair of musical segments. Several algorithms have been pro-posed for achieving such a task between audio signals. Butthe main of these approaches are based on timbre similar-ity, mainly evaluated with statistics on low-level audio fea-tures. For example, Music Browser (Sony CSL, Paris) com-putes a similarity measure according to Gaussian models of cepstrum coefficients [13]. However, since this informationabout timbre is not relevant for the copyright protection of music documents, SACEM considers musical elements suchas melody, harmony or rhythm. Therefore, computer-basedsystems should be able to study these musical elements.Then, two problematics are raising: the extraction of musi-cal elements from audio signals in order to define symbolicdata, and comparing these data.In this paper, we present new techniques based on edit align-ment algorithms. In Section 2, we present some of the exist-ing string matching algorithms that have been adapted tothe musical context. Then in Section 3, we describe someimprovements dedicated to music documents. In Section 4,we introduce different options for estimating music similar-ity. We present finally in Section 5 some perspectives andremaining problems in the context of the detection of near-duplicate music documents or plagiarisms.  2. MEASURING SIMILARITY BETWEENSEQUENCES Musical pieces can be described as sequences of elements(notes) [12]. Measuring similarity between sequences is awell-known problem in computer science which has appli-cations in many fields such as text processing, data com-pression, bio-informatics [9, 15]. In this section, we treatthe string matching algorithms that can be adapted to themusical context. 2.1 Musical Sequences Several techniques for evaluating symbolic music similaritieshave been introduced during the last few years. Geometricalgorithms consider geometric representations of melodiesand compute the distance between objects.Some of thesesystems [20] are closely linked to the well-known piano-roll representation. Other ones represent notes by weightedpoints [17].We propose here to investigate adaptations of string match-ing algorithms, since experiments show their accuracy andtheir flexibility in the musical context [8]. Such adapta-tion requires a representation of musical pieces as sequence.In the case of monophonic music (no more than one noteis sounded at any given time), a musical piece can be as-sociated to a sequence of integers, representing pitches of successive notes. 2.2 String Matching Algorithms In [11], Levenshtein defines the notion of edit distance be-tween two strings. This distance is defined as the minimumcost of all possible sequences of elementary operations (editoperations) that transform one string into the other. Thisdistance can be computed in quadratic time  O ( | S  1 | · | S  2 | )and linear space using a dynamic programming algorithm[21]. A dual problem of edit distance is to compute align-ment of two strings. The alignment of two strings consistsin computing a mapping between the symbols of the strings.Symbols not involved in the mapping are designed as gap.The main difference between alignment and edit distance isthat alignment computes a score of similarity: the highestis this score the highest is the similarity.In many applications, two strings may not be highly simi-lar in their entirety but may contain regions that are highlysimilar. In this case, the problem is to find and extract apair of regions, one from each of the two given strings, thatexhibits high similarity. This is called  local alignment   or local similarity problem   [16]. The computation of a localsimilarity allows us to detect local conserved areas betweenboth sequences. Experiments show that considering localalignment improves the quality of symbolic melodic similar-ity systems [8]. 3. ALGORITHMICIMPROVEMENTSFORMUSIC DOCUMENTS Experiments during the the first Music Information RetrievalEvaluation eXchange (MIREX 2005) [6] clearly show thatthe accuracy of direct application of the existing string match-ing algorithms is limited. That is the reason why severalimprovements have been recently proposed which are pre-sented in this section. 3.1 Representations of Music Musical pieces are associated to sequences of notes. Therepresentation of notes is therefore an important problem.Symbolic music analysis systems generally consider the in-formation about pitch and duration [12] which are assumedto be the two main characteristics of musical notes. Sev-eral alphabets of characters and set of numbers have thusbeen proposed to represent these parameters [18]. The vo-cabulary chosen highly depends on the application. Forapplications like near-duplicate music document detection,some music retrieval properties are expected. For instance,since a musical piece can be transposed and played fasteror slower without degrading the melody, such systems haveto be transposition invariant and tempo invariant. In themonophonic context, only a few representations enables sys-tems to be transposition and tempo invariant: representingpitches by the difference between successive pitches ( inter-val  ) or in the case of tonal music, by the difference betweenthe pitch and the key of the musical piece for example.Experiments have been performed in [8] which confirm thatthe  interval   parameter leads to the most precise symbolicmelodic similarity system. Moreover, other experimentsshowthat taking into account the duration of notes significantlyimproves such systems. 3.2 Edit Operations specific to Music Substitution is the main edit operation and mainly deter-mines the accuracy of the music similarity algorithm. Forsome applications, the substitution score is assumed as con-stant. However, in the musical context, this assumptionmust be discussed [18]. It is obvious that substituting onepitch with another one has not always the same influenceon the general melody. For example, substituting a  C   notewith a  G  note (fifth) slightly modifies a melody in compar-ison with substituting a  C   note with a  D  note. As intro-duced by [12] the substitution score may be correlated tothe consonance interval. It has to be determined accordingto consonance: the fifth (7 semitones) and the third majoror minor (3 or 4 semitones) are the most consonant inter-vals in western music. Experiments show that this choicesignificantly improves algorithms [8].Other improvements have been experimentally shown. Forexample, considering the note duration for the calculationof the insertion/deletion scores improves the quality of thesimilarity systems. Indeed, the insertion of a half note maydisturb more significantly a melody than the insertion of asixteenth note. 3.3 Weighting by Taking into Account MusicTheory We think that a preliminary music analysis may highlightthe properties that help listeners to perceptually discrimi-nate two musical patterns. This analysis may therefore leadto the modification of edit operations specific to music. Forexample, the notes located on the stronger beats in a barcan be considered as more important than the other ones  x x x xov+ + +~ ~ ~+ + Figure 1: Analysis of a musical piece allows to iden-tify the different functions of the notes and theirplacement inside the bar. Above the notes, “x” tagsthe importance of the note regarding the tonalitylimited to the tonic and the dominant tones (respec-tively G and D for a G Major tonality here). “v” isused to identify the passing note and “o” for a noteon the weak part of the beat (which is not a passingnote). Under the staff, “ + ” stands for the strongbeats and “˜” for the weak ones. and can be weighted more than the notes placed on weakbeats.In [14], we proposed to use some notions of music theory toimprove the edit-based systems. A few musical elements areanalyzed and taken into account during the calculation of the edit score. Tonality:  One of the most important characteristics of thetraditional western music is the tonality. The tonic is thepitch upon which all the other pitches of a piece are hier-archically centered. The scale associated to a tonality be-gins by the tonic. In western tonal music, the tonic andthe dominant are very important. They are often used andtheir succession composes for example the perfect cadencethat commonly ends a musical piece. In the G major or inthe G minor key, tonic is the note G and dominant is thenote D, like in the example of the Fig. 1. Therefore, thealignment algorithm proposed takes into account the tonicand the dominant: if the difference in semi-tones (modulo12) between each note of the melody and the tonic equals 0(the tonic note) or 7 (dominant), the note is assumed to beimportant and is therefore marked. The musical sequencealignment favours matches between these marked notes. Passing Notes:  The algorithm proposed in [14] detectsthe passing notes in a musical piece. A passing note is as-sumed as a note between two others in a constant movement(ascending or descending) which is diatonic or chromatic.There is one occurrence of a passing note in Fig. 1. The editscores are computed according to the information about thepassing notes so that the insertion or the deletion of passingnotes is less penalized by the similarity system. Strong and Weak Beats:  The bar is a segment of timein a musical piece defined as a given number of beats of agiven duration. In function of their position in the bar, thebeats can be strong or weak with parts that are also strongor weak. We have proposed to mark the notes placed onthe beats. A weight is associated to each of these notes,depending of the strength of the beat. In 4/4 time, thestrong beats are the first (a weight 4 is given), and the third(weight 2) of the bars. Other beats are weighted with 1,and the other notes, which are not on the beats, are notweighted. An example of the different strengths is illustratedin Fig. 1. Our algorithm takes into account these weightednotes by favouring matches between notes on strong beats,and by not penalizing insertion or deletion of notes on theweak part of the beat. 3.4 Adaptation to Polyphony To take into account the polyphonic nature of musical se-quences, we propose to use a quotiented sequence represen-tation. Formally, a quotiented sequence is a sequence graphwith an equivalence relation defined on the set of vertices,such that the resulting quotient graph is also a sequence. Aquotiented sequence can be considered as a self-similar struc-ture represented by sequences on two different scales. A quo-tiented sequence can also be modelled by a tree of depth 2where the leaves represent the support sequence and the in-terior nodes represent the quotient sequence. In the contextof polyphonic music, notes that occur at the same time aregrouped to form a quotiented sequence  Q = ( S,W,π ) where S   is a suite of notes,  W   the suite of chords and  π  the appli-cation that maps a set of notes to each chord. Each vertexof the quotiented sequence is labelled by the pitch and theduration of each note. [10] has proposed two distances be-tween quotiented sequences based on the computation of anoptimal suite of edit operations that preserves equivalencerelations on sequence vertices.Furthermore, as previously explained, since a near-duplicatemusical piece can be transposed (one or several times) with-out degrading the melody, algorithms for detecting near-duplicate music have to be transposition invariant. Thus, [2]proposes an srcinal dynamic programming algorithm thatallows edit based algorithms to take into account successivelocal transpositions and to deal with transposed polyphonicmusic. 3.5 System for Detecting Near-Duplicate Mu-sic Documents According to the improvements presented in this section, wedeveloped an edit-distance based algorithm for estimatingsimilarity between symbolic melodic fragments. It allows usto consider a musical piece (or a fragment) and compare it toa symbolic music database. The system presented computesan edit score by comparing the musical piece tested and allthe pieces of the database. The more important the score is,the more similar the pieces compared are. This system havealready been evaluated in the last few years. It obtains thevery accurate results with MIREX 2005 training database[8]. It also participated to the MIREX 2006 contest andobtained the best results in the monophonic context. Dif-ferences with other edit-distance based algorithms show thatthe optimizations proposed, specific to the musical context,permit to significantly improve such algorithms. 4. MUSIC SIMILARITY In this section, we propose to illustrate with examples thedifferent ways for automatically evaluating the musical sim-ilarity between musical pieces. We consider some famousexamples of plagiarisms in order to show that a computer-based method is able to automatically detect near-duplicate
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!