Creating new variables from word list and assigning 1 or 0 if the word appears in the string of a separate variable in R

Question

I have a variable where the observation will have various notes put into it by a person. Some of the words in any given observation could be key words that need to be tracked.

If I have a list of the key words, is there a streamlined way to create variables from that list, and then search through the existing observations to flag whether or not the word is in there? An extra component is that due to the human element, words can't be counted on to be in a particular order or delimiters such as a space may be omitted, letters upper/lower case. There is also the possibility that a word like "flights" might be missing the "s." Because the keywords may change, is there also a way to code it so that the words can be created as a value that can be updated and then rerun to update variables?

In the df below the list of key words I'm looking for abc, xyz, flights.

df <- read.table(text =
                   "ID Notes
ID-0001   'ABC project xyz'
ID-0002   'XYZ'
ID-0003   'ABCschedule flightsok test'
ID-0004   'flight, abc' 
ID-0005   'normal notes no key'", header = T)

The desired output would look like this:

desired.output <- read.table(text =
                               "ID Notes abc xyz flights
ID-0001   'ABC project xyz'  1  1  0  
ID-0002   'XYZ' 0  1  0
ID-0003   'ABCschedule flightsok'  1  0  1
ID-0004   'flight, abc' 1  0  1 
ID-0005   'normal notes no key'  0  0  0 ", header = T)

I found this similar question but it wasn't quite what I was looking for, due to the variable names being created from every word in an observation. R: Splitting a string to different variables and assign 1 if string contains this word

Thank you for the help!

akrun · Accepted Answer · 2021-10-19T15:59:13.890

1

We may use grepl for this

transform(df, abc = +(grepl('\\babc', Notes, ignore.case = TRUE)), 
     xyz = +(grepl('\\bxyz\\b', Notes, ignore.case = TRUE)), 
     flights = +(grepl('\\bflights?', Notes, ignore.case = TRUE)))
       ID                      Notes abc xyz flights    
1 ID-0001            ABC project xyz   1   1       0
2 ID-0002                        XYZ   0   1       0
3 ID-0003 ABCschedule flightsok test   1   0       1
4 ID-0004                flight, abc   1   0       1
5 ID-0005        normal notes no key   0   0       0

Or just loop over the words of interest and use grepl

df[c('abc', 'xyz', 'flights')] <- +(sapply(c('abc', 'xyz', 'flights'), function(x) grepl(x, df$Notes)))

edited Oct 19 '21 at 15:59

answered Oct 18 '21 at 20:55

akrun

874,273
37
540
662

Is your solution case sensitive? I noticed that in line 3 the ABC at the beginning wasn't counted (there are times when human error causes words to have no space) and in line 4, flight has no "s" and is not marked. – ks54 Oct 18 '21 at 21:58
@ks54 I would add word boundary (`\\b`) and ignore.case = TRUE as in my update – akrun Oct 19 '21 at 01:42
thank you for the response, there is still an issue in the results you’re showing. And when i copy and paste your code I get slightly different outcomes. In line 2, “XYZ” is uppercase and does not show up in the results. Line 3 has the same problem with “ABC” not showing in the results. When I run it, the difference is in line 3, which also omits “flights” from the results. Any ideas? – ks54 Oct 19 '21 at 14:27
@ks54 you can check now with the update. I think you want to match partial strings as well. – akrun Oct 19 '21 at 15:59

Creating new variables from word list and assigning 1 or 0 if the word appears in the string of a separate variable in R

1 Answers1