[]

WHAT TO SUBMIT: Nothing is required for this part.

Start up your VM and issue the following commands, which will install mrjob python package and create a working directory for your assignment:

/opt/datacourse/sync.sh
cp -pr /opt/datacourse/assignments/hw08-template/ ~/hw08/

Execute a MapReduce job locally

To get you started, We have provided a sample file tweet_sm.txt containing a fairly small number of tweets and a simple program hashtag_count.py that counts the total number of tweets and the total number of uses of hashtags.

You can execute the MapReduce program locally on your VM, without using a cluster, as follows:

python hashtag_count.py tweets_sm.txt

As a rule of thumb, you should always test and debug your MapReduce program locally on smaller datasets, before you attempt it on a big cluster on Amazon---it will cost you money!

Signing up for Amazon AWS and setting up mrjob/EMR

If you haven't done so, follow the instructions here to sign up for Amazon AWS. You do not need to create an instance at this time; you can ignore that section (and the sections that follow) in the instructions. You will simply use your own VM to program; later on, when you run your code, you can tell it to automatically make use of additional computing resources, remotely, on Amazon AWS.
Next, you need to have a key pair for accessing Amazon AWS.
- Case 1: Assuming that you haven't created an Amazon EC2 instance before, open up a web browser inside your VM and go to https://console.aws.amazon.com/ec2/home:
  - Make sure the Region dropdown (upper left) matches the region you want to run jobs in ("US East").
  - Click on Key Pairs (lower left), and then Create Key Pair (center).
  - Name your key pair datacourse, and Download Key: save the file datacourse.pem in the assignment working directory on your VM.
- Case 2: If you have created an Amazon EC2 instance before, you should already have a key pair called datacourse and a file datacourse.pem on your computer. You need to copy the file datacourse.pem into the assignment working directory in your VM.
- In either case, run the following command in the assignment working directory: chmod go= datacourse.pem
Now, edit the file mrjob.conf in the assignemnt working directory on your VM. Replace the first two fields with your own access key id and secret access key. (They are not the key pair you obtained in the last step.) You can find them by going to https://console.aws.amazon.com/iam/home?#security_credential. Look under the section "Access Keys".
- You can create a new key pair if you like (be sure to note the secret key), but if you already have a key and go to the above link click on "Access Keys (Access Key ID and Secret Access Key)", you should see a link to the legacy "Security Credentials" page. From there it may give you an alert that Amazon will be phasing out that page in the future, but you can just click continue and then retrieve your key. This link can also take you directly to the legacy page.
- Make sure there is a space between each colon and its ensuing value in mrjob.conf; it's touchy about that!

Executing your MapReduce program on Amazon EMR

Now, you are ready to run a program on Amazon EMR (Elastic MapReduce using Hadoop)! Just type the following command:

python hashtag_count.py -c mrjob.conf -r emr tweets_sm.txt

Tips:

If you get an invalid SSH key error, it might be a region match error. Key pairs are bound to specific regions. You can check the region in the upper right of the AWS Management Console, and make sure it matches what's specified in mrjob.conf.
You will see lots of diagonistic info flying by. The program will actually take longer to run on Amazon than on your local machine! That's expected, because of the various overhead involved. MapReduce on a cluster is really meant for problems much, much larger than this one.
In mrjob.conf, you can optionally tweak:
- The cluster size (num_ec2_instances, i.e., the number of Amazon machines) on which to run your program. For massive data you will need a larger cluster to finish in a reasonable amount of time. But don't go overboard because more machines imply more money will be charged to your account.
- The cluster type (ec2_instance_type:), which is currently set to m1.small. This basic machine should suffice for us. Fancier machine types cost more.
- The maximum number of concurrent map and reduce tasks per machine (under bootstrap_actions). Generally speaking, machines with more cores can run more concurrent tasks. You can leave this out and just trust EMR's default setting.
- The total number of map and reduce tasks per job (mapreduce.mapred.map.tasks and mapred.reduce.tasks under jobconf). You can usually leave them unspecified; EMR/Hadoop will pick reasonable defaults based on what it thinks of as a reasonable unit amount of work per task. Even if you specify them, Hadoop may choose to ignore them in some cases, so think of them as only optimization "hints".

In this exercise, you will write a MapReduce program to find the 50 most popular hashtags from a file containing approximately 3.5 million tweets (a couple hours in the twitterverse).

We strongly recommend that you debug first locally before running on the full file in Amazon. You can make a copy of the example code:

cp hashtag_count.py topk.py

and edit topk.py to suit your needs. When running on Amazon, you can use the following files:

Small test file: s3://cs290-spring2014/twitter/tweets_sm.txt
Big file: s3://cs290-spring2014/twitter/tweets.txt

You can give these file URLs to your python program directly, e.g.:

python topk.py -c mrjob.conf -r emr s3://cs290-spring2014/twitter/tweets.txt

Hint 1: Your approach should work on much bigger datasets than what we are using here. Keep in mind the tips from class about potential problems when computing on data in parallel.

Hint 2: You can create standard python data structures within MRJob functions or class instances. Keep in mind that these will be stored in-memory for each mapper or reducer, so if used, any such structure should be kept small.

WHAT TO SUBMIT: Submit a plain-text file named topk.txt with your results. You need to submit the results on the big file (the small file is there in case you need to debug your program). Submit also your code (topk.py) including comments explaining the code.

Homework #8: MapReduce¶

0. Getting Started¶

Execute a MapReduce job locally

Signing up for Amazon AWS and setting up mrjob/EMR

Executing your MapReduce program on Amazon EMR

1. Finding the top 50 Twitter hashtags¶