Run Examine and Train

To run this program, the following argument list is required:

  1. YOUR_TWEET_INPUT - This is the file pattern for input tweets.
  • OUTPUT_MODEL_DIR - This is the directory to persist the model.
  • NUM_CLUSTERS - The number of clusters the algorithm should create.
  • NUM_ITERATIONS - The number of iterations the algorithm should run for.

Here is an example command to run ExamineAndTrain.scala:

  1. % ${YOUR_SPARK_HOME}/bin/spark-submit \
  2. --class "com.databricks.apps.twitter_classifier.ExamineAndTrain" \
  3. --master ${YOUR_SPARK_MASTER:-local[4]} \
  4. target/scala-2.10/spark-twitter-lang-classifier-assembly-1.0.jar \
  5. "${YOUR_TWEET_INPUT:-/tmp/tweets/tweets*/part-*}" \
  6. ${OUTPUT_MODEL_DIR:-/tmp/tweets/model} \
  7. ${NUM_CLUSTERS:-10} \
  8. ${NUM_ITERATIONS:-20}