Seventh String Software banner

A reflection about automatic transcription of music

I am assuming here that we are talking about taking a recording of some music and producing written music in standard notation. The purpose of this would be either to enable a musician to play the piece by reading the music, or perhaps just to understand the music better, for educational purposes.

From time to time I get messages from people telling me that they have solved, or are about to solve, the problem of automatic transcription. When I ask them for more information it generally turns out that they have written a program to do what I call "note detection" on simple material. That means taking a musical recording and detecting what notes are being played at what time. This is not very difficult as long as it is something simple like solo piano, and Transcribe! already does this - see the Piano Roll view. What they generally don't understand is that getting from note detection to a written transcription in standard notation is absolutely not a trivial task, and in fact is much harder than basic note detection. Of course note detection also becomes very difficult when the musical material is more complex and has more instruments playing.

This note is addressed to those people, and to anyone else who thinks that computers should be capable of automatically transcribing music from recordings.

Many people, especially if they have not had much experience of transcribing music, imagine that this is a process of identifying the notes being played and translating them into notation. This is not entirely false but it is a dangerously simplistic view. Written music is a series of instructions to the musician, which the musician will interpret as they see fit. Therefore the true purpose of transcription is to try to decide what series of notational instructions would cause a musician to play the music that we are hearing, and write down that series of instructions. The process of transcription from sound to notation is not really a translation, it is more like reverse engineering.

To illustrate this with an analogy, imagine someone is sitting in a chair and they have a slip of paper with some instructions. The instructions say:

  1. Stand up, turn a full circle, and sit down again.
  2. Scratch your head with your right hand.
  3. Clap your hands three times.
  4. Smile.

The person carries out the instructions and the computer's job is to look at the video and try to figure out what instructions they were following. Without going into details, I trust you can see that this is not easy. If the computer generates instructions saying "move this leg in this direction, raise that arm..." then these instructions will be almost useless. We can imagine a person trying to carry them out, and then finally saying "Ah! You mean that you want me to turn around". For a successful transcription, the computer needs to recognise the person's intention (to turn around) and write that down.

Suppose a piano player plays a run, but some of the notes overlap because the second note starts before the first has been released. Should we notate these overlaps? If we do then the result will be a mess, hard to read. Probably we should just notate a run with no overlaps. We should notate the intention not the execution. A good musician does not play rhythms in the way that a strict definition of the note values would suggest - if they did, it would sound mechanical, unmusical. To notate a played rhythm we must figure out the intention, the meaning of the rhythm. This is what quantisation is about of course, and it's not easy to get it right. In general a knowledge of the musical style and of the particular instrument is necessary, in order to produce a useful transcription. For instance if a guitar is strumming, then a complete transcription of every note played is probably not appropriate. Instead, chord symbols with an indication of the strumming rhythm would be more useful. If we did present a guitarist with a complete transcription of every note then it would be very complex (because the notes in a strummed chord are not played simultaneously), and we can imagine the musician struggling with it for a while before saying "Ah! You mean that you want me to strum a G major chord".

All rhythms are notated in relation to the beats of a measure (bar) so before you can even begin to think about how you will notate the rhythms you must decide where the beats and the measures are. For some material this may be easy, for instance if there is a bass drum hitting the first beat of every measure, but as a general matter it is not necessarily easy at all for the computer. Even musicians can find this difficult if the musical style is unfamiliar to them. And if you get it wrong then your transcription will not make much sense.

The system of standard musical notation has existed for hundreds of years, and has evolved so that all the things which are normally played in western music have conventional representations on the page. For example, to write a chord consisting of B, Eb and F# is almost certainly wrong, even if the notes are correct. It should be B, D# and F#, or possibly Cb, Eb and Gb. Similar issues apply to rhythms, key signatures, time signatures, and indeed every aspect of written music. There are always a small number of ways of notating something readably and infinitely many ways of notating it badly, which will cause musicians to recoil in horror when you ask them to play it.

And now here's a sobering thought : all the issues I have so far discussed, apply even if the computer starts with a perfectly accurate list of which notes were played at what time. But when the computer is starting from an audio recording, with anything other than the very simplest material, the list of notes it detects will have many inaccuracies. This will multiply the problems many times over. I'm not saying it's impossible - after all, a skilled musician can do it - but I am saying that it might well be AI Complete. Also see the footnote at the bottom of this page.

Then you must ask what the output of your program will be used for. The chances are that it will, on anything but the simplest material, contain large numbers of mistakes, or notations which might be technically correct but unreadable. So the first thing the user will need to do is correct them. This means that the user needs to be capable of transcribing themselves, so your program does not replace the skills of a human transcriber, the best it can do is save them a bit of time. If your program manages to produce something that's close enough to be used as a starting point then it might be useful. There are already various programs which attempt to do this, see here, but I have yet to hear of anyone who finds them useful.

Of course if you are aiming to output MIDI then the problems are rather fewer. MIDI does not distinguish between D# and Eb, and the question of how to notate rhythms does not arise. Your main problems are identifying the notes in the audio, and locating the beat and deciding how many beats to the measure (bar). This is also not easy though it can be possible on some material. If you are producing midi output though, you have to ask whether it will be useful, and if so what for? Note that if you don't correctly identify where the beat is and how many beats there are in each measure, then you will get a MIDI file which might play ok but which will be difficult to make sense of when you load it into a MIDI editor. The beat detection problem disappears if we are talking about a performance which was played to a MIDI click track, and in this case useful results can be achieved on some material. But then we are no longer talking about the general question of transcribing music from an existing audio recording.

I've mentioned the importance of knowing what the transcription is to be used for, and here I will say a little more about this. If the transcription is being produced out of academic interest, or for educational purposes, then the ideal would be to produce a full score with a stave for each instrument used in the original recording, which of course can be very difficult. For this approach you will need a thorough knowledge of what the various instruments are capable of and how they are played - for instance, suppose you are transcribing a solo guitar performance. In that case we know that what we are hearing is somehow playable on a solo guitar, and this is very useful information in guiding our transcription - if we find ourselves writing something that cannot be played then either our transcription is wrong, or the player has used some double tracking on the recording, or the player is using an ingenious playing technique that we are unaware of. In my experience professional musicians generally transcribe music for entirely practical purposes, typically because they want to play the piece themselves. This means that the transcription needs to be tailored for the combination of instruments it will be played on, regardless of the instruments used on the original recording. There is no point in producing a detailed transcription of the drum part, if you intend to play the piece on solo piano. In some ways this can make the job easier as you can ignore many details - for instance, suppose that there is a C major chord being played on a keyboard. It may not be easy to decide exactly how the chord is voiced, or whether the guitar is also present in the mix - but quite possibly you don't need to know these things for the purpose of your transcription. On the other hand, you will find yourself facing many judgement decisions about which details of the original are important and must be retained, and which can be discarded. Does the piece have a distinctive bass part that should be transcribed note for note, or is it sufficient to give the bass player chord symbols? Is there a distinctive guitar lick which needs to be present or people won't recognise the song? My point here is that you must choose an approach depending on the purpose of your transcription, and whatever approach you choose, you will face great difficulties. Incidentally writing chord symbols can be quite an art too, depending on the complexity of the material. E.g. if additional notes appear due to the presence of a moving line, should we include those in our chord notations, or should we simplify? How should we indicate rhythmic accents?

Footnote June 2023 The new generation of AIs are making amazing progress in creating natural language text, and they handle grammar better than many native speakers. The problem of notating music in a readable form, as opposed to notations which might be correct in some technical sense but unreadable for a human, could be seen as a grammatical problem. I wonder how long it will be before we have an AI that can listen to a musical recording and write out whatever we want... "Write out the guitar part", "Write out a lead sheet", "Write out a full score"?


Recommend this page to others, on these social network sites: