##Week 10: Machine Learning
Topics to Cover
-
preprocessing for text machine learning:
-
removing stop words
-
lemmatizing
-
-
classifying texts using scikit-learn
-
author attribution
-
cluster analysis
Provide collection of several hundred texts grouped by genre:
- news articles
- blog posts
- literary prose
- poetry
- scientific articles
- spam emails
Have students choose two or three categories to work with.
Using scikit-learn rain ML model using small number of texts and measure classification accuracy for remaining set.
Train model using more texts and see if there’s any improvement.
Compare models.
Look at mis-classified texts and discuss what features make them outliers.
Break
Demonstrate cluster analysis.
Sentiment analysis: Evaluate/classify Twitter data.