Use a natural language processing library (instead of custom method) to preprocess
Example : https://www.nltk.org
Or https://docs.python.org/3/library/tokenize.html (maybe less powerful, I don't know) to tokenize
Use Porter Algortihm (For the moment usine and usines are two different words) ==> keed word root "usine" or "usin" depending (different algos)
Keras tutorial which using this library : https://www.kaggle.com/jeffd23/visualizing-word-vectors-with-t-sne
Edited by Romain Guillot