fastTextR - Word representations

fastTextR is an R interface to the fastText library. It can be used to word representation learning (Bojanowski et al., 2016) and supervised text classification (Joulin et al., 2016). Particularly the advantage of fastText to other software is that, it was designed for biggish data.

The following example is based on the examples provided in the fastText library, the example shows how to use fastTextR for word representation. For more informations about word representations can be found at the fastText homepage.

library("fastTextR")

Load pretrained model

The training of these models can be quite time consuming therefore pre-trained models are a good option.

model <- ft_load("cc.en.300.bin")

Printing word vectors

ft_word_vectors(model, c("asparagus", "pidgey", "yellow"))[,1:5]

##                   [,1]          [,2]         [,3]       [,4]         [,5]
## asparagus 0.0292057190 -0.0114405714 -0.003201437 0.03087331  0.127229080
## pidgey    0.0452978685  0.0090015158  0.067562237 0.11123407 -0.008441916
## yellow    0.0007776691 -0.0001886144  0.001824494 0.03869999  0.036413591

Printing sentence vectors

ft_sentence_vectors(model, c("Poets have been mysteriously silent on the subject of cheese", "Who did not let the gorilla into the ballet"))[,1:5]

???

Nearest neighbor queries

ft_nearest_neighbors(model, 'asparagus', k = 5L)

##   aspargus broccolini artichokes asparagus.  asparagas 
##  0.7316202  0.6995656  0.6930545  0.6915916  0.6911229

Word analogies

ft_analogies(model, c("berlin", "germany", "france"))

##        paris      france.      avignon  montpellier       paris. 
##    0.6831182    0.6408537    0.6288283    0.6138449    0.6059716 
##       rennes       london       Paris.       toulon montparnasse 
##    0.5884554    0.5832924    0.5743204    0.5727922    0.5715630

References

[1] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

@article{bojanowski2016enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.04606},
  year={2016}
}

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@article{joulin2016bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.01759},
  year={2016}
}