DUE: Monday 3/03 11:59pm
HOW TO SUBMIT: Submit the required files for all problems (see WHAT TO SUBMIT under each problem below) through WebSubmit. On the WebSubmit interface, make sure you select compsci290
and the appropriate homework number. You can submit multiple times, but please resubmit files for all problems each time.
WHAT TO SUBMIT: Nothing is required for this part.
Start up your VM and issue the following commands, which will install mrjob
python package and create a working directory for your assignment:
/opt/datacourse/sync.sh
cp -pr /opt/datacourse/assignments/hw08-template/ ~/hw08/
To get you started, We have provided a sample file tweet_sm.txt
containing a fairly small number of tweets and a simple program hashtag_count.py
that counts the total number of tweets and the total number of uses of hashtags.
You can execute the MapReduce program locally on your VM, without using a cluster, as follows:
python hashtag_count.py tweets_sm.txt
As a rule of thumb, you should always test and debug your MapReduce program locally on smaller datasets, before you attempt it on a big cluster on Amazon---it will cost you money!
If you haven't done so, follow the instructions here to sign up for Amazon AWS. You do not need to create an instance at this time; you can ignore that section (and the sections that follow) in the instructions. You will simply use your own VM to program; later on, when you run your code, you can tell it to automatically make use of additional computing resources, remotely, on Amazon AWS.
datacourse
, and Download Key: save the file datacourse.pem
in the assignment working directory on your VM.datacourse
and a file datacourse.pem
on your computer. You need to copy the file datacourse.pem
into the assignment working directory in your VM.chmod go= datacourse.pem
Now, edit the file mrjob.conf
in the assignemnt working directory on your VM. Replace the first two fields with your own access key id and secret access key. (They are not the key pair you obtained in the last step.) You can find them by going to https://console.aws.amazon.com/iam/home?#security_credential. Look under the section "Access Keys".
mrjob.conf
; it's touchy about that!Now, you are ready to run a program on Amazon EMR (Elastic MapReduce using Hadoop)! Just type the following command:
python hashtag_count.py -c mrjob.conf -r emr tweets_sm.txt
Tips:
If you get an invalid SSH key error, it might be a region match error. Key pairs are bound to specific regions. You can check the region in the upper right of the AWS Management Console, and make sure it matches what's specified in mrjob.conf
.
You will see lots of diagonistic info flying by. The program will actually take longer to run on Amazon than on your local machine! That's expected, because of the various overhead involved. MapReduce on a cluster is really meant for problems much, much larger than this one.
In mrjob.conf
, you can optionally tweak:
num_ec2_instances
, i.e., the number of Amazon machines) on which to run your program. For massive data you will need a larger cluster to finish in a reasonable amount of time. But don't go overboard because more machines imply more money will be charged to your account.ec2_instance_type:
), which is currently set to m1.small
. This basic machine should suffice for us. Fancier machine types cost more.bootstrap_actions
). Generally speaking, machines with more cores can run more concurrent tasks. You can leave this out and just trust EMR's default setting.mapreduce.mapred.map.tasks
and mapred.reduce.tasks
under jobconf
). You can usually leave them unspecified; EMR/Hadoop will pick reasonable defaults based on what it thinks of as a reasonable unit amount of work per task. Even if you specify them, Hadoop may choose to ignore them in some cases, so think of them as only optimization "hints".In this exercise, you will write a MapReduce program to find the 50 most popular hashtags from a file containing approximately 3.5 million tweets (a couple hours in the twitterverse).
We strongly recommend that you debug first locally before running on the full file in Amazon. You can make a copy of the example code:
cp hashtag_count.py topk.py
and edit topk.py
to suit your needs. When running on Amazon, you can use the following files:
Small test file: s3://cs290-spring2014/twitter/tweets_sm.txt
Big file: s3://cs290-spring2014/twitter/tweets.txt
You can give these file URLs to your python program directly, e.g.:
python topk.py -c mrjob.conf -r emr s3://cs290-spring2014/twitter/tweets.txt
Hint 1: Your approach should work on much bigger datasets than what we are using here. Keep in mind the tips from class about potential problems when computing on data in parallel.
Hint 2: You can create standard python data structures within MRJob functions or class instances. Keep in mind that these will be stored in-memory for each mapper or reducer, so if used, any such structure should be kept small.
WHAT TO SUBMIT: Submit a plain-text file named topk.txt
with your results. You need to submit the results on the big file (the small file is there in case you need to debug your program). Submit also your code (topk.py
) including comments explaining the code.