4

I am trying to compute the probability of a document to belong to each topic found by the LDA model. I have succeded in producing the LDA but now I am stuck. My code goes as following:

## Libraries to download
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim

## Tokenizing
tokenizer = RegexpTokenizer(r'\w+')

# create English stop words list
en_stop = stopwords.words('english')

# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

import json
import nltk
import re
import pandas

appended_data = []
for i in range(2005,2016):
    if i > 2013:
        df0 = pandas.DataFrame([json.loads(l) for l in open('SDM_%d.json' % i)])
        appended_data.append(df0)
    df1 = pandas.DataFrame([json.loads(l) for l in open('Scot_%d.json' % i)])
    df2 = pandas.DataFrame([json.loads(l) for l in open('APJ_%d.json' % i)])
    df3 = pandas.DataFrame([json.loads(l) for l in open('TH500_%d.json' % i)])
    df4 = pandas.DataFrame([json.loads(l) for l in open('DRSM_%d.json' % i)])
    appended_data.append(df1)
    appended_data.append(df2)
    appended_data.append(df3)
    appended_data.append(df4)

appended_data = pandas.concat(appended_data)
doc_set = appended_data.body

# list for tokenized documents in loop
texts = []

# loop through document list
for i in doc_set:

    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]

    # add tokens to list
    texts.append(stopped_tokens)

# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)

# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=15, id2word = dictionary, passes=50)

I am trying to follow the method here but I find it confusing. For example, when I try the following code:

# Assinging the topics to the document in corpus
lda_corpus = ldamodel[corpus]

# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = list(chain(*[[score for topic_id,score in topic] \
                     for topic in [doc for doc in lda_corpus]]))

threshold = sum(scores)/len(scores)
print(threshold)

cluster1 = [j for i,j in zip(lda_corpus,doc_set) if i[0][1] > threshold]

print(cluster1)

It seems it works, since it retrieves the the articles that belong to topic 1. Nevertheless, can someone explain what is the intuition behind and if there is other alternatives. For example, what is the intuition behind the threshold level here? Thanks

Community
  • 1
  • 1
Economist_Ayahuasca
  • 1,648
  • 24
  • 33
  • 4
    "I find it confusing" doesn't really describe the problem; it leaves us at the point of writing a tutorial on one stage of document classification, which is beyond the scope of StackOverflow. If you can get detailed about where you're lost, perhaps someone can help get you through that block. – Prune May 26 '16 at 17:57
  • 1
    Absolutely right, I have updated the question to make it more specific. Cheers – Economist_Ayahuasca May 26 '16 at 18:17
  • 2
    Good job; close vote retracted. Thanks! – Prune May 26 '16 at 18:29

1 Answers1

6

As I hope you've read elsewhere, the threshold is an application-specific setting, depending on how broad you want your classification model. The 1/k (for k clusters) rationale is empirical: it works as a starting point (i.e. it yields recognizably useful results) for most classification tasks.

The gut-level rationale is simple enough: if a document is matched to a topic strongly enough to outshine the chance of a random cluster placement, it's likely a positive identification. Of course, you have to tune "likely" once you get your first results.

Most notably, you want to watch for one or two "noisy" clusters, those in which the topics are only loosely related: the cluster's standard deviation is larger than most. Some applications compute Z-scores for the topics and have a Z-based threshold for each topic. Others have a generic threshold for all the topics in a given cluster.

Your final solution depends on your required strength of match (lower thresholds), topic variation (topic-specific thresholds), required accuracy (what are the costs of false positive and false negative?) and desired training & scoring speeds.

Is this enough help to move you along?

Prune
  • 76,765
  • 14
  • 60
  • 81
  • Great @Prune, strongly appreciated! – Economist_Ayahuasca May 27 '16 at 07:41
  • Another question @Prune, do you happen to know the way to calculate the weight/percentage of each topic in the total corpus? Sorry for any inconvenience. – Economist_Ayahuasca May 27 '16 at 15:22
  • 1
    That depends on what you mean. I suspect that what you want is [number of papers in which topic appears] / [total number of papers in the corpus]. – Prune May 27 '16 at 19:18
  • Yes that's it and the way you have describe it I guess is the approach of finding it. I though maybe there was a command already that calculated the percentage of each topic in the corpus. Cheers and thanks once again! – Economist_Ayahuasca May 27 '16 at 19:47
  • Ah ... sorry. I'm not as familiar with the Pandas facilities. My experience is with Trusted Analytics Platform, where the tool kit (ATK) returns the weights as part of the LDA modelling of the corpus. I doubt that the overhead of installing TAP would be worth the trouble for your application. – Prune May 27 '16 at 22:37