I made an AI listen to 200 Ed Sheeran songs and told it to write Valentine’s Day messages
Nothing says “Happy Valentine’s Day, dear!” like wasting 4 hours to train an AI instead of taking 15 minutes to hand-write a thoughtful Valentine’s day message. So let’s go for it!
Note: If you just want to try the message generator yourself, scroll to the bottom.
Some people just have a way with words (not me of course). Many people would agree that Ed Sheeran sure does! So with the recent advances in AI and natural language processing, let’s see if we can pick his brain to come up with greeting card worthy messages.
To do that, we will take an existing language model (GPT-2) and retrain it in Google Colab based on Ed’s song lyrics.
Step 1, get data
Data collection is usually most of the work here, when you can stand on the shoulders of Deep Learning giants for the existing model. Luckily, we don’t have to google Ed Sheeran’s most popular 200 songs and copy/paste the lyrics to a txt file. Instead, we can use the API created by the people from Genius.com, my favorite site for song lyrics. You’ll need to create your own developer account and tokens here. Once you have those:
!pip install git+https://github.com/johnwmillr/LyricsGenius.git
import os
import lyricsgenius
genius = lyricsgenius.Genius('enter your own token here',
remove_section_headers=True,
excluded_terms=["(Remix)", "(Live)", "(Acoustic)"],
timeout=15, retries=3)artist = genius.search_artist("Ed Sheeran", max_songs=200, sort="popularity")file_name = "ed_sheeran_lyrics.txt"
with open(file_name, 'a') as f:
for song in artist.songs:
data = song.lyrics
f.write(data)
f.close()
We login to the API using our personal token and retrieve the 200 most popular songs. We exclude things like acoustic versions, because we don’t want too many duplicate lyrics polluting the data. We also exclude the headers, lest your crush get suspicious when reading words like [Intro] or [Chorus] in your beautifully personalized Valentine’s Day poem.
Step 2, load and retrain GPT-2
In our case, we will use GPT-2 as the pre-trained model. The reason for this model is that more recent models like GPT-J are too large to retrain on most accessible compute resources. Out of pure spite, I won’t mention that other recent not-so-open AI-model. Anyway, for GPT-2 there is a very user friendly python package called gpt-2-simple. The technical term for retraining the last part of a pre-trained model for your custom use-case is called finetuning by the way.
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files
import os
import requests#download the smallest pre-trained model to save time
gpt2.download_gpt2(model_name="124M")
gpt2.mount_gdrive()#load the data
file_name = "ed_sheeran_lyrics.txt"
gpt2.copy_file_from_gdrive(file_name)#finetune the model with the new data
sess = gpt2.start_tf_sess()
gpt2.finetune(sess,
dataset=file_name,
model_name='124M',
steps=1000,
restore_from='fresh',
run_name='run_ed_1',
print_every=10,
sample_every=200,
save_every=500
)gpt2.copy_checkpoint_to_gdrive(run_name='run_ed_1')
gpt2.copy_checkpoint_from_gdrive(run_name='run_ed_1')
Step 3, give it some Valentine’s day prompts to complete
Now that we have our own neural network that’s trained to be a ̶h̶o̶p̶e̶l̶e̶s̶s̶ hopeful romantic, let’s give it a text prompt like “To my Valentine” and make it work for us:
prefix = input("Your prompt (e.g. To my Valentine,): ")
gpt2.generate(sess,
length=50,
temperature=0.7,
prefix=prefix,
nsamples=5,
batch_size=5,
run_name='run_ed_1'
)
Some of us like/need to take advantage of the law of large numbers when it comes to dating, so the script generates 5 messages at a time. The results are…well not what you might’ve hoped, but exactly what you expected. Here’s an excerpt. Some are…concerning, to say the least.
Lessons learned
It’s been a nice exercise in creating a basic finetuning pipeline for pre-trained language models, so do go ahead and try it yourself for your own use cases. Real world use of most of the above texts could lead to a drastic update in relationship status from “in a relationship” to “it’s complicated”. There are a few straightforward ways to improve these results:
- improve input data quality. Garbage in is still garbage out. Ed’s lyrics are of course Genius (pun intended). But we did not put any serious effort into making sure that all lyrics put into the model are what we would find desirable output. From the output, it seems that the input lyrics included quite a few songs about heartbreaks as well.
- add more data. Currently, with 200 songs, there’s only 10000 or so lines of text. So we’re currently starving the very data hungry language models. Keep in mind that newer models are trained on pretty much the entirety of the English web.
- use a bigger model. We used the smallest version of the GPT-2 model because bigger ones would either take too long to train or would literally not run on our available hardware. This goes to show how power hungry language models are.
Try it yourself
Here’s the code to the prediction colab, give it a try if you want to generate your own messages. You can try a bunch of different input prompts and see what rolls out. If you find hilarious results, share the joy by posting them and feel free to tag me: on Twitter or LinkedIn
And of course, Happy Valentine’s Day if you made it all the way here ;)
Sources
LyricsGenius python API wrapper
Lyrics downloader colab code
gpt-2-simple Python library (includes a link to the original Colab by Max Woolf that I slightly adapted)