1

I'm using the R tm package for text analysis on a facebook group, and find that the removewords function isn't working for me. I tried to combine the french stopwords with my own, but they are still appearing. So I create a file named "french.txt" with my own list as in the following command:

nom_fichier <- "Analyse textuelle/french.txt"
my_stop_words <- readLines(nom_fichier, encoding="UTF-8")

Here is the data for text mining:

text <- readLines(groupe_fb_ief, encoding="UTF-8")```
docs <- Corpus(VectorSource(text))
inspect(docs) 

Here are the tm_map commands:

docs <- tm_map(docs, tolower)

docs <- tm_map(docs, stripWhitespace)

docs <- tm_map(docs, removePunctuation)

docs <- tm_map(docs, removeNumbers)

docs <- tm_map(docs, removeWords, my_stop_words)

Applying that, it's still not working and I don't understand why. I even try to change to order of the commands with no result.

Do you have any idea ? Is it possible to change the french stopwords within R ? Where this list is located ?

Thanks!!

  • What does "still not working" mean exactly? Are you getting a proper corpus but it still contains all the words? It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Oct 28 '19 at 20:25
  • Hi, thank your for your answer. It means that when I generate the wordcloud, the words contained in my_stop_words are still appearing whereas they should have been removed with the removewords function. – Pauline PROBOEUF Oct 29 '19 at 15:00

1 Answers1

0

Rather than use RemoveWords, I typically use an anti_join() to remove all stop words.

library(tidytext)
my_stop_words <- my_stop_words  %>%
  unnest_tokens(output = word, input = text, token = "words")

# anti_join
anti_join(docs,my_stop_words, by = "word")

That is if the the column that contains your corpus is called "word". Hope this helps.

Doug
  • 22
  • 5
  • Hi Doug ! I think what isn't working is my corpus is really long (it's actually a single very long line in a .txt file) and maybe that's the readLines that isn't working and not the removewords function. – Pauline PROBOEUF Oct 29 '19 at 15:30
  • I think you might want to "unnest" your tokens in a tidy format. You could try: ``` library(tidytext) my_stop_words <- my_stop_words %>% unnest_tokens(output = word, input = text, token = "words") # anti_join anti_join(docs,my_stop_words, by = "word") ``` I edited my answer above. – Doug Oct 29 '19 at 17:32
  • Great! Sorry to be a nerd, but I'm trying to bolster my SO account. Would you mind marking my answer as correct? – Doug Oct 29 '19 at 21:47