I watched a scary movie last night and couldn’t get to sleep because I was so afraid
I went out with my friends to a bar and slept poorly because I drank too much
How represented with bag of words?
binary, count, tf, tf-idf
What are the problems with these approaches
What is benefit of stemming, stop words, etc with BoW (and n-grams)
TF - number of times a word/term appears in document divide by total number of words in document
IDF - log(number of documents in corpus divided by number of documents containing the word/term)
TF-IDF = TF * IDF
How different?
some context/order
but still no relationship/meaning or similarity
even higher dimensionality
Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology, 29(1), 24–54. https://doi.org/
Meaningful features? (based on domain expertise)
Include sentiment
Lower dimensional
Very limited breadth
Can add domain specific dictionaries
word2vec (google) for embeddings - CBOW - skipgram
Components of word2vec
fasttext (facebook)
glove (Stanford)
BERT is pre-trained on a large corpus of text using two main tasks:
Masked Language Model (MLM): Randomly masks some of the words in the input text and trains the model to predict the masked words based on the surrounding context. This helps BERT learn bidirectional representations by considering both left and right contexts
Next Sentence Prediction (NSP): Trains the model to predict whether a given pair of sentences are consecutive in the original text. This helps BERT understand the relationships between sentences.
Uses 12 hidden layers with 768 units each
Use 12 attention heads (BERT uses self-attention mechanisms to weigh the importance of different words in a sentence. This allows the model to understand the context of each word by considering its relationships with other words)
Advantages