Let’s start off by loading the data from the tidytuesday github repository.

Reading in data

theme_set(
  theme_light() + 
    theme(panel.grid.minor = element_blank())
)

beyonce_lyrics <-
  readr::read_csv(
    'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/beyonce_lyrics.csv'
  ) %>%
  janitor::clean_names()

Rows: 22616 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): line, song_name, artist_name
dbl (3): song_id, artist_id, song_line

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

taylor_swift_lyrics <-
  readr::read_csv(
    'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/taylor_swift_lyrics.csv'
  )  %>%
  janitor::clean_names()

Rows: 132 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Artist, Album, Title, Lyrics

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data pre-processing and feature engineering

It appears the beyonce_lyrics and the taylor_swift_lyrics are the pertinent data sets for building our machine learning classifier. Let’s have a closer look at the two datasets.

beyonce_lyrics

# A tibble: 22,616 × 6
   line                        song_id song_name artist_id artist_name song_line
   <chr>                         <dbl> <chr>         <dbl> <chr>           <dbl>
 1 If I ain't got nothing, I …   50396 1+1             498 Beyoncé             1
 2 If I ain't got something, …   50396 1+1             498 Beyoncé             2
 3 'Cause I got it with you      50396 1+1             498 Beyoncé             3
 4 I don't know much about al…   50396 1+1             498 Beyoncé             4
 5 And it's me and you           50396 1+1             498 Beyoncé             5
 6 That's all we'll have when…   50396 1+1             498 Beyoncé             6
 7 'Cause baby, we ain't got …   50396 1+1             498 Beyoncé             7
 8 Darling, you got enough fo…   50396 1+1             498 Beyoncé             8
 9 So come on, baby, make lov…   50396 1+1             498 Beyoncé             9
10 When my days look low         50396 1+1             498 Beyoncé            10
# ℹ 22,606 more rows

glimpse(beyonce_lyrics)

Rows: 22,616
Columns: 6
$ line        <chr> "If I ain't got nothing, I got you", "If I ain't got somet…
$ song_id     <dbl> 50396, 50396, 50396, 50396, 50396, 50396, 50396, 50396, 50…
$ song_name   <chr> "1+1", "1+1", "1+1", "1+1", "1+1", "1+1", "1+1", "1+1", "1…
$ artist_id   <dbl> 498, 498, 498, 498, 498, 498, 498, 498, 498, 498, 498, 498…
$ artist_name <chr> "Beyoncé", "Beyoncé", "Beyoncé", "Beyoncé", "Beyoncé", "Be…
$ song_line   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…

taylor_swift_lyrics

# A tibble: 132 × 4
   artist       album        title                      lyrics                  
   <chr>        <chr>        <chr>                      <chr>                   
 1 Taylor Swift Taylor Swift Tim McGraw                 "He said the way my blu…
 2 Taylor Swift Taylor Swift Picture to Burn            "State the obvious, I d…
 3 Taylor Swift Taylor Swift Teardrops on my Guitar     "Drew looks at me,\nI f…
 4 Taylor Swift Taylor Swift A Place in This World      "I don't know what I wa…
 5 Taylor Swift Taylor Swift Cold As You                "You have a way of comi…
 6 Taylor Swift Taylor Swift The Outside                "I didn't know what I w…
 7 Taylor Swift Taylor Swift Tied Together With A Smile "Seems the only one who…
 8 Taylor Swift Taylor Swift Stay Beautiful             "Cory's eyes are like a…
 9 Taylor Swift Taylor Swift Should’ve Said No          "It's strange to think …
10 Taylor Swift Taylor Swift Mary’s Song                "She said\n\"I was seve…
# ℹ 122 more rows

glimpse(taylor_swift_lyrics)

Rows: 132
Columns: 4
$ artist <chr> "Taylor Swift", "Taylor Swift", "Taylor Swift", "Taylor Swift",…
$ album  <chr> "Taylor Swift", "Taylor Swift", "Taylor Swift", "Taylor Swift",…
$ title  <chr> "Tim McGraw", "Picture to Burn", "Teardrops on my Guitar", "A P…
$ lyrics <chr> "He said the way my blue eyes shinx\nPut those Georgia stars to…

The beyonce_lyrics appears to be structured differently than the taylor_swift_lyrics. The lyrics from Taylor Swift is stored 1 line per title/song name, while Beyonce’s lyrics are stored by song lines. This is a problem, and we’ll have to rectify this prior to building our classifier.

My idea of rectifying this would be to collapse the data from Beyonce’s lyrics to get them into the same structure as Taylor Swift’s lyrics.

beyonce_lyrics <- beyonce_lyrics %>%
  group_by(
    artist_name, song_name
  ) %>%
  summarise(
    lyrics = paste0(line, collapse = ' ')
  ) %>% 
  ungroup() %>%
  select(
    'artist' = artist_name,
    'title' = song_name,
    lyrics
  )

`summarise()` has grouped output by 'artist_name'. You can override using the
`.groups` argument.

beyonce_lyrics

# A tibble: 391 × 3
   artist  title                                                lyrics          
   <chr>   <chr>                                                <chr>           
 1 Beyoncé "\"Self-Titled\" Part 1 . The Visual Album"          "I see music. I…
 2 Beyoncé "\"Self-Titled\" Part 2 . Imperfection"              "There's a mome…
 3 Beyoncé "***Flawless (Ft. Chimamanda Ngozi Adichie)"         "Your challenge…
 4 Beyoncé "***Flawless (Remix) (Ft. Nicki Minaj)"              "Dum-da-de-da D…
 5 Beyoncé "1+1"                                                "If I ain't got…
 6 Beyoncé "2017 Grammy's Best Urban Contemporary Album Speech" "Thank you so m…
 7 Beyoncé "6 Inch (Ft. The Weeknd)"                            "Six inch heels…
 8 Beyoncé "7/11"                                               "Shoulders side…
 9 Beyoncé "7/11 (Homecoming Live)"                             "Goddamn, godda…
10 Beyoncé "A Woman Like Me"                                    "Do you think Y…
# ℹ 381 more rows

Okay - this appears to be much better, and will allow us to merge them together with the Taylor Swift data. Our outcome y that we are trying to predict will be the ‘artist’ column. The features x will be the song lyrics. In order to get them to a usable state for our model, we will have to perform some preprocessing and feature engineering.

taylor_swift_lyrics <- taylor_swift_lyrics %>% 
  select(
    artist, title, lyrics
  )

data <- bind_rows(
  taylor_swift_lyrics, beyonce_lyrics
)

data

# A tibble: 523 × 3
   artist       title                      lyrics                               
   <chr>        <chr>                      <chr>                                
 1 Taylor Swift Tim McGraw                 "He said the way my blue eyes shinx\…
 2 Taylor Swift Picture to Burn            "State the obvious, I didn't get my …
 3 Taylor Swift Teardrops on my Guitar     "Drew looks at me,\nI fake a smile s…
 4 Taylor Swift A Place in This World      "I don't know what I want, so don't …
 5 Taylor Swift Cold As You                "You have a way of coming easily to …
 6 Taylor Swift The Outside                "I didn't know what I would find\nWh…
 7 Taylor Swift Tied Together With A Smile "Seems the only one who doesn't see …
 8 Taylor Swift Stay Beautiful             "Cory's eyes are like a jungle\nHe s…
 9 Taylor Swift Should’ve Said No          "It's strange to think the songs we …
10 Taylor Swift Mary’s Song                "She said\n\"I was seven, and you we…
# ℹ 513 more rows

After merging the data together, we will split our data into a separate training and testing dataset.

Model building

set.seed(1)
splits <- initial_split(data, strata = artist)

train <- training(splits)
test <- testing(splits)

The data has now been split, where 75% of the data we have available will be used to train our classifier, and the remaining 25% will be left for validation of the model and to estimate the overall performance.

We will next create a ‘recipe’ and perform feature engineering on our training data. We will do this in various steps, including tokenizing the lyrics, removing stop words, excluding words that appear less than 20 times, performing term-frequency inverse-document-frequency (TF-IDF), and finally normalization.

rec <- recipe(artist ~ lyrics, data = train) %>%
  step_tokenize(lyrics) %>%
  step_stopwords(lyrics) %>%
  step_tokenfilter(lyrics, min_times = 20, max_tokens = 500) %>%
  step_tfidf(lyrics) %>%
  step_normalize(all_predictors())

rec

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:   1
predictor: 1

── Operations

• Tokenization for: lyrics

• Stop word removal for: lyrics

• Text filtering for: lyrics

• Term frequency-inverse document frequency with: lyrics

• Centering and scaling for: all_predictors()

Now that we have a ‘recipe’ for pre-processing our data into a usable state to feed into our model, we will create a specification of the classifier. In this instance we will be building a support vector machine (SVM) classifier from the kernlab package.

svm_spec <- svm_rbf(cost = tune(), rbf_sigma = tune()) %>%
  set_engine('kernlab') %>%
  set_mode('classification')

svm_spec

Radial Basis Function Support Vector Machine Model Specification (classification)

Main Arguments:
  cost = tune()
  rbf_sigma = tune()

Computational engine: kernlab

Model parameter tuning

The model parameters cost and rbf_sigma will be tuned via a grid search of 10 values

svm_wf <- workflow() %>%
  add_model(svm_spec) %>%
  add_recipe(rec)

svm_tune_folds <- vfold_cv(train, strata = artist)

set.seed(1)
svm_tune_res <- tune_grid(
  svm_wf,
  resamples = svm_tune_folds,
  grid = 10
)

tune_metrics <- svm_tune_res %>% collect_metrics()

tune_metrics %>%
  filter(., .metric == 'accuracy') %>%
  ggplot(.,
         aes(y = rbf_sigma, x = cost, color = mean)) +
  geom_point() +
  scale_color_viridis_c()

svm_tune_res %>% show_best(metric = 'accuracy')

# A tibble: 5 × 8
    cost rbf_sigma .metric  .estimator  mean     n std_err .config              
   <dbl>     <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
1 3.15    2.37e- 4 accuracy binary     0.804    10 0.0150  Preprocessor1_Model04
2 0.439   8.95e- 2 accuracy binary     0.747    10 0.00199 Preprocessor1_Model01
3 0.0342  4.45e- 3 accuracy binary     0.747    10 0.00199 Preprocessor1_Model02
4 0.0177  4.34e- 8 accuracy binary     0.747    10 0.00199 Preprocessor1_Model03
5 0.0856  8.84e-10 accuracy binary     0.747    10 0.00199 Preprocessor1_Model05

best_accuracy <- svm_tune_res %>% select_best(., metric = 'accuracy')

best_accuracy

# A tibble: 1 × 3
   cost rbf_sigma .config              
  <dbl>     <dbl> <chr>                
1  3.15  0.000237 Preprocessor1_Model04

Final Model

The optimal tuning parameters for accuracy appears to be 3.154 for cost and 0 for rbf_sigma. We will use these parameters for our final model. We will fit our final model on the full training data, and assess the performance on the test data.

svm_final_wf <- finalize_workflow(
  svm_wf,
  best_accuracy
)

final_res <- svm_final_wf %>%
  last_fit(splits)

final_metrics <- final_res %>% collect_metrics()

final_metrics

# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.802 Preprocessor1_Model1
2 roc_auc  binary         0.892 Preprocessor1_Model1

Our final model using the tuned parameters optimizing for accuracy allowed us to achieve a model accuracy of 80.2% and ROC of 0.89

Let’s have a closer look at the performance of the model via a confusion matrix

final_preds <- final_res %>%
  collect_predictions()


final_preds %>%
  conf_mat(
    ., artist, .pred_class
  )

              Truth
Prediction     Beyoncé Taylor Swift
  Beyoncé           96           24
  Taylor Swift       2            9

final_preds %>%
  conf_mat(
    ., artist, .pred_class
  ) %>%
  summary()

# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             binary         0.802
 2 kap                  binary         0.324
 3 sens                 binary         0.980
 4 spec                 binary         0.273
 5 ppv                  binary         0.8  
 6 npv                  binary         0.818
 7 mcc                  binary         0.395
 8 j_index              binary         0.252
 9 bal_accuracy         binary         0.626
10 detection_prevalence binary         0.916
11 precision            binary         0.8  
12 recall               binary         0.980
13 f_meas               binary         0.881

Closing

The model appears to be doing a decent job classifying the artist by the lyrics of the songs with an overall accuracy of 80.2%. Furthermore, the model appears to be doing a better job at classifying Beyonce lyrics than Taylor Swift’s

Let’s have a closer look at the songs that were misclassified from our model

test %>%
  select(., -artist) %>%
  bind_cols(final_preds) %>%
  select(
    artist, title,
    .pred_Beyoncé, `.pred_Taylor Swift`,
    .pred_class
  ) %>%
  filter(
    artist != .pred_class
  ) %>%
  mutate(
    across(c(.pred_Beyoncé, `.pred_Taylor Swift`), ~ paste0(format(round(.x*100, 1), 1), '%'))
  ) %>%
  group_by(., artist) %>%
  gt() %>%
  cols_label(
    artist = 'Artist',
    title = 'Title',
    .pred_Beyoncé = '% Beyonce',
    `.pred_Taylor Swift` = '% Taylor',
    .pred_class = 'Prediction'
  )

Title	% Beyonce	% Taylor	Prediction
Taylor Swift
Should’ve Said No	50.4%	49.6%	Beyoncé
I’m Only Me When I’m With You	11.4%	88.6%	Beyoncé
A Perfectly Good Heart	34.1%	65.9%	Beyoncé
Untouchable	92.7%	7.3%	Beyoncé
Hey Stephen	50.5%	49.5%	Beyoncé
Breathe	8.9%	91.1%	Beyoncé
Tell Me Why	45.5%	54.5%	Beyoncé
Better Than Revenge	8.4%	91.6%	Beyoncé
Innocent	31.6%	68.4%	Beyoncé
Ours	78.3%	21.7%	Beyoncé
If This Was a Movie	11.1%	88.9%	Beyoncé
22	27.2%	72.8%	Beyoncé
Sad Beautiful Tragic	58.4%	41.6%	Beyoncé
Come Back Be Here	31.9%	68.1%	Beyoncé
Welcome to New York	16.3%	83.7%	Beyoncé
Out of the Woods	26.7%	73.3%	Beyoncé
How You Get The Girl	59.6%	40.4%	Beyoncé
I Know Places	47.7%	52.3%	Beyoncé
Ready For It	85.1%	14.9%	Beyoncé
False God	39.8%	60.2%	Beyoncé
ME	76.7%	23.3%	Beyoncé
illicit affairs	97.3%	2.7%	Beyoncé
betty	14.7%	85.3%	Beyoncé
peace	89.0%	11.0%	Beyoncé
Beyoncé
Creole	0.2%	99.8%	Taylor Swift
Diamonds	3.1%	96.9%	Taylor Swift

Session info

sessionInfo()

R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] kernlab_0.9-32     stopwords_2.3      gt_0.9.0           future_1.32.0     
 [5] textrecipes_1.0.2  lubridate_1.9.2    forcats_1.0.0      stringr_1.5.0     
 [9] readr_2.1.4        tidyverse_2.0.0    yardstick_1.1.0    workflowsets_1.0.1
[13] workflows_1.1.3    tune_1.1.1         tidyr_1.3.0        tibble_3.2.1      
[17] rsample_1.1.1      recipes_1.0.5      purrr_1.0.1        parsnip_1.1.0     
[21] modeldata_1.1.0    infer_1.0.4        ggplot2_3.4.2      dplyr_1.1.1       
[25] dials_1.2.0        scales_1.2.1       broom_1.0.4        tidymodels_1.0.0  
[29] tidytuesdayR_1.0.2

loaded via a namespace (and not attached):
 [1] colorspace_2.1-0    ellipsis_0.3.2      class_7.3-20       
 [4] snakecase_0.11.0    fs_1.6.1            rstudioapi_0.14    
 [7] farver_2.1.1        listenv_0.9.0       furrr_0.3.1        
[10] SnowballC_0.7.0     bit64_4.0.5         prodlim_2023.03.31 
[13] fansi_1.0.4         xml2_1.3.3          codetools_0.2-18   
[16] splines_4.2.2       knitr_1.41          jsonlite_1.8.4     
[19] compiler_4.2.2      httr_1.4.5          backports_1.4.1    
[22] Matrix_1.5-1        fastmap_1.1.0       cli_3.6.1          
[25] htmltools_0.5.4     tools_4.2.2         gtable_0.3.3       
[28] glue_1.6.2          Rcpp_1.0.10         cellranger_1.1.0   
[31] DiceDesign_1.9      vctrs_0.6.1         iterators_1.0.14   
[34] timeDate_4022.108   gower_1.0.1         xfun_0.38          
[37] globals_0.16.2      rvest_1.0.3         timechange_0.2.0   
[40] lifecycle_1.0.3     renv_0.16.0         MASS_7.3-58.1      
[43] ipred_0.9-14        vroom_1.6.1         hms_1.1.3          
[46] parallel_4.2.2      yaml_2.3.6          curl_5.0.0         
[49] sass_0.4.5          rpart_4.1.19        stringi_1.7.8      
[52] tokenizers_0.3.0    foreach_1.5.2       lhs_1.1.6          
[55] hardhat_1.3.0       lava_1.7.2.1        rlang_1.1.0        
[58] pkgconfig_2.0.3     evaluate_0.19       lattice_0.20-45    
[61] labeling_0.4.2      htmlwidgets_1.6.2   bit_4.0.5          
[64] tidyselect_1.2.0    parallelly_1.35.0   magrittr_2.0.3     
[67] R6_2.5.1            generics_0.1.3      pillar_1.9.0       
[70] withr_2.5.0         survival_3.4-0      nnet_7.3-18        
[73] future.apply_1.10.0 janitor_2.2.0       crayon_1.5.2       
[76] utf8_1.2.3          tzdb_0.3.0          rmarkdown_2.19     
[79] usethis_2.1.6       grid_4.2.2          readxl_1.4.2       
[82] data.table_1.14.8   digest_0.6.31       GPfit_1.0-8        
[85] munsell_0.5.0       viridisLite_0.4.1

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{luu2020,
  author = {Luu, Michael},
  title = {Natural {Language} {Processing} {(NLP)} and Developing a
    Machine Learning Classifier on {Beyonce} and {Taylor} {Swift} Lyrics
    {\#TidyTuesday}},
  date = {2020-10-02},
  langid = {en}
}

For attribution, please cite this work as:

Luu, Michael. 2020. “Natural Language Processing (NLP) and Developing a Machine Learning Classifier on Beyonce and Taylor Swift Lyrics #TidyTuesday.” October 2, 2020.