tm_map: Can use the removewords function with my own stopwords registered as an txt file?

Question

I'm using the R tm package for text analysis on a facebook group, and find that the removewords function isn't working for me. I tried to combine the french stopwords with my own, but they are still appearing. So I create a file named "french.txt" with my own list as in the following command:

nom_fichier <- "Analyse textuelle/french.txt"
my_stop_words <- readLines(nom_fichier, encoding="UTF-8")

Here is the data for text mining:

text <- readLines(groupe_fb_ief, encoding="UTF-8")```
docs <- Corpus(VectorSource(text))
inspect(docs)

Here are the tm_map commands:

docs <- tm_map(docs, tolower)

docs <- tm_map(docs, stripWhitespace)

docs <- tm_map(docs, removePunctuation)

docs <- tm_map(docs, removeNumbers)

docs <- tm_map(docs, removeWords, my_stop_words)

Applying that, it's still not working and I don't understand why. I even try to change to order of the commands with no result.

Do you have any idea ? Is it possible to change the french stopwords within R ? Where this list is located ?

Thanks!!

What does "still not working" mean exactly? Are you getting a proper corpus but it still contains all the words? It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, Oct 28 '19 at 20:25
Hi, thank your for your answer. It means that when I generate the wordcloud, the words contained in my_stop_words are still appearing whereas they should have been removed with the removewords function. — Pauline PROBOEUF, Oct 29 '19 at 15:00

Doug · Accepted Answer · 2019-10-29T17:35:55.517

0

Rather than use RemoveWords, I typically use an anti_join() to remove all stop words.

library(tidytext)
my_stop_words <- my_stop_words  %>%
  unnest_tokens(output = word, input = text, token = "words")

# anti_join
anti_join(docs,my_stop_words, by = "word")

That is if the the column that contains your corpus is called "word". Hope this helps.

edited Oct 29 '19 at 17:35

answered Oct 28 '19 at 20:24

Doug

22
5

Hi Doug ! I think what isn't working is my corpus is really long (it's actually a single very long line in a .txt file) and maybe that's the readLines that isn't working and not the removewords function. – Pauline PROBOEUF Oct 29 '19 at 15:30
I think you might want to "unnest" your tokens in a tidy format. You could try: ``` library(tidytext) my_stop_words <- my_stop_words %>% unnest_tokens(output = word, input = text, token = "words") # anti_join anti_join(docs,my_stop_words, by = "word") ``` I edited my answer above. – Doug Oct 29 '19 at 17:32
Great! Sorry to be a nerd, but I'm trying to bolster my SO account. Would you mind marking my answer as correct? – Doug Oct 29 '19 at 21:47

tm_map: Can use the removewords function with my own stopwords registered as an txt file?

1 Answers1