Yahoo! 4000 node cluster, sorted 1TB random integers in 62 seconds using MapReduce
public class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public class WordCount {
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
Developed and used as primary platform by Facebook
Developed and used as primary platform by Yahoo!
lines = LOAD '../data/words.txt' USING TextLoader() AS (sentence:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(sentence)) AS word;
groupedWords = GROUP words BY word;
counts = FOREACH groupedWords GENERATE group, COUNT(words);
STORE counts INTO 'output/wordcounts' USING PigStorage();
Where words.txt is:
Deer Bear River
Car Car River
Deer Car Bear
Generates:
Car 3
Bear 2
Deer 2
River 2
create table textlines(text string);
load data local inpath 'C:\work\ClearPoint\Data20\data\words.txt' overwrite into table textlines;
create table words(word string);
insert overwrite table words select explode(split(text, '[ \t]+')) word from textlines;
select word, count(*) from words group by word;
Most of the following examples use a sample of the MovieLens, movie recommendation data http://www.grouplens.org/node/73. The sample contains 100,000 ratings for 1682 movies by 943 users.
Ratings in tab separated list of user id, item id, rating, timestamp
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
Movie data, | separated list of movie id, movie title, release date, video release date,IMDb URL, unknown, Action, Adventure, Animation, Children's, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir,Horror, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western
1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
User demographic data, | separated list of user id, age, gender, occupation, zip code
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
users = LOAD '../data/u.user' USING PigStorage('|') AS (userId,age,gender,occupation,zipCode);
allUsers = GROUP users ALL;
stats = FOREACH allUsers GENERATE COUNT(users), AVG(users.age), SUM(users.age);
byGender = GROUP users BY gender;
genderStats = FOREACH byGender GENERATE group, COUNT(users), AVG(users.age);
programmers = FILTER users BY occupation == 'programmer';
progsByAge = GROUP programmers BY age;
progCountsByAge = FOREACH progsByAge GENERATE group AS age, COUNT(programmers) as NumProgs;
progsCountsByAgeSorted = ORDER progCountsByAge BY NumProgs DESC;
STORE stats INTO 'output/stats' USINg PigStorage('\t');
STORE genderStats INTO 'output/genderStats' USING PigStorage('\t');
STORE progsCountsByAgeSorted INTO 'output/progsCountsByAgeSorted' USING PigStorage('\t');
943 34.05196182396607 32111.0
F 273 33.81318681318681
M 670 34.149253731343286
27 7
30 5
25 5
28 4
21 3
35 3
23 3
29 3
38 3
41 2
24 2
46 2
49 2
26 2
31 2
32 2
33 2
37 2
40 2
20 1
34 1
36 1
42 1
44 1
50 1
52 1
53 1
60 1
63 1
votes = LOAD '../data/u.data' USING PigStorage('\t') AS (userId,itemId,rating,timestamp);
movies = LOAD '../data/u.item' USING PigStorage('|') AS (movieId,movieTitle,releaseDate,videoReleaseDate,imdbURL,unknown,
Action,Adventure,Animation,Childrens,Comedy,Crime,Documentary,Drama,Fantasy,FilmNoir,Horror,Musical,Mystery,Romance,
SciFi,Thriller,War,Western);
movieVotesGroup = GROUP votes BY itemId;
movieVotes = FOREACH movieVotesGroup GENERATE FLATTEN(group) AS movieId,
AVG(votes.rating) AS avgRating, COUNT(votes) AS numVotes;
moviesWithVotes = JOIN movieVotes BY movieId,movies BY movieId;
moviesWithVotesSorted = ORDER moviesWithVotes BY movieVotes::avgRating DESC;
top25 = LIMIT moviesWithVotesSorted 25;
STORE top25 INTO 'output/topMovies' USING PigStorage('\t');
1653 5.0 1 1653 Entertaining Angels: The Dorothy Day Story (1996) 27-Sep-1996 http://us.imdb.com/M/title-exact?Entertaining%20Angels:%20The%20Dorothy%20Day%20Story%20(1996) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
1293 5.0 3 1293 Star Kid (1997) 16-Jan-1998 http://us.imdb.com/M/title-exact?imdb-title-120478 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0
1467 5.0 2 1467 Saint of Fort Washington, The (1993) 01-Jan-1993 http://us.imdb.com/M/title-exact?Saint%20of%20Fort%20Washington,%20The%20(1993) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
814 5.0 1 814 Great Day in Harlem, A (1994) 01-Jan-1994 http://us.imdb.com/M/title-exact?Great%20Day%20in%20Harlem,%20A%20(1994) 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
1500 5.0 2 1500 Santa with Muscles (1996) 08-Nov-1996 http://us.imdb.com/M/title-exact?Santa%20with%20Muscles%20(1996) 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
1201 5.0 1 1201 Marlene Dietrich: Shadow and Light (1996) 02-Apr-1996 http://us.imdb.com/M/title-exact?Marlene%20Dietrich:%20Shadow%20and%20Light%20(1996)%20(TV) 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
1122 5.0 1 1122 They Made Me a Criminal (1939) 01-Jan-1939 http://us.imdb.com/M/title-exact?They%20Made%20Me%20a%20Criminal%20(1939) 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0
1189 5.0 3 1189 Prefontaine (1997) 24-Jan-1997 http://us.imdb.com/M/title-exact?Prefontaine%20(1997) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
1599 5.0 1 1599 Someone Else's America (1995) 10-May-1996 http://us.imdb.com/M/title-exact?Someone%20Else's%20America%20(1995) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
1536 5.0 1 1536 Aiqing wansui (1994) 22-Jul-1996 http://us.imdb.com/M/title-exact?Aiqing%20Wansui%20(1994) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
1449 4.625 8 1449 Pather Panchali (1955) 22-Mar-1996 http://us.imdb.com/M/title-exact?Pather%20Panchali%20(1955) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
1594 4.5 2 1594 Everest (1998) 10-Mar-1998 http://us.imdb.com/Title?Everest+(1998) 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
119 4.5 4 119 Maya Lin: A Strong Clear Vision (1994) 01-Jan-1994 http://us.imdb.com/M/title-exact?Maya%20Lin:%20A%20Strong%20Clear%20Vision%20(1994) 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
1398 4.5 2 1398 Anna (1996) 13-Nov-1996 http://us.imdb.com/M/title-exact?Anna%20(1996) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
1642 4.5 2 1642 Some Mother's Son (1996) 27-Dec-1996 http://us.imdb.com/M/title-exact?Some%20Mother's%20Son%20(1996) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
408 4.491071428571429 112 408 Close Shave, A (1995) 28-Apr-1996 http://us.imdb.com/M/title-exact?Close%20Shave,%20A%20(1995) 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
318 4.466442953020135 298 318 Schindler's List (1993) 01-Jan-1993 http://us.imdb.com/M/title-exact?Schindler's%20List%20(1993) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0
169 4.466101694915254 118 169 Wrong Trousers, The (1993) 01-Jan-1993 http://us.imdb.com/M/title-exact?Wrong%20Trousers,%20The%20(1993) 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
483 4.45679012345679 243 483 Casablanca (1942) 01-Jan-1942 http://us.imdb.com/M/title-exact?Casablanca%20(1942) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0
114 4.447761194029851 67 114 Wallace & Gromit: The Best of Aardman Animation (1996) 05-Apr-1996 http://us.imdb.com/Title?Wallace+%26+Gromit%3A+The+Best+of+Aardman+Animation+(1996) 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
64 4.445229681978798 283 64 Shawshank Redemption, The (1994) 01-Jan-1994 http://us.imdb.com/M/title-exact?Shawshank%20Redemption,%20The%20(1994) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
603 4.3875598086124405 209 603 Rear Window (1954) 01-Jan-1954 http://us.imdb.com/M/title-exact?Rear%20Window%20(1954) 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0
12 4.385767790262173 267 12 Usual Suspects, The (1995) 14-Aug-1995 http://us.imdb.com/M/title-exact?Usual%20Suspects,%20The%20(1995) 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0
50 4.3584905660377355 583 50 Star Wars (1977) 01-Jan-1977 http://us.imdb.com/M/title-exact?Star%20Wars%20(1977) 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0
178 4.344 125 178 12 Angry Men (1957) 01-Jan-1957 http://us.imdb.com/M/title-exact?12%20Angry%20Men%20(1957) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
> users <- read.table('C:/work/ClearPoint/Data20/data/u.user', header=FALSE, sep='|',
+ col.names=c('userid','age','gender','occupation','zipcode'));
> summary(users);
userid age gender occupation zipcode
Min. : 1.0 Min. : 7.00 F:273 student :196 55414 : 9
1st Qu.:236.5 1st Qu.:25.00 M:670 other :105 55105 : 6
Median :472.0 Median :31.00 educator : 95 10003 : 5
Mean :472.0 Mean :34.05 administrator: 79 20009 : 5
3rd Qu.:707.5 3rd Qu.:43.00 engineer : 67 55337 : 5
Max. :943.0 Max. :73.00 programmer : 66 27514 : 4
(Other) :335 (Other):909
> table(users$age);
7 10 11 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
1 1 1 5 3 6 5 14 18 23 32 27 37 28 33 38 34 35 36 32 39 25 28 26 17 27 21 19 17 22 21 10 21 13 23
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 68 69 70 73
15 12 14 20 19 20 20 6 12 4 11 6 9 3 3 9 3 2 3 2 3 1 2 2 3 1
> plot(users$gender);
> hist(users$age);
> library(lattice);
> xyplot(occupation~age,users);
> xyplot(occupation~age|gender,users);
> histogram(~age|gender, data=users, layout=c(1, 2), breaks=max(users$age));
> histogram(~age|gender, data=users, layout=c(1, 2), breaks=max(users$age));
> histogram(~age|occupation, data=users,col=heat.colors(10));
> histogram(~gender|occupation, data=users, col=heat.colors(2));
First prep data using Pig:
votes = LOAD '../data/u.data' USING PigStorage('\t') AS (userId,itemId,rating,timestamp);
movies = LOAD '../data/u.item' USING PigStorage('|') AS (movieId,movieTitle,releaseDate,videoReleaseDate,
imdbURL,unknown,Action,Adventure,Animation,Childrens,Comedy,Crime,Documentary,Drama,Fantasy,FilmNoir,
Horror,Musical,Mystery,Romance,SciFi,Thriller,War,Western);
movieVotesGroup = GROUP votes BY itemId;
movieVotes = FOREACH movieVotesGroup GENERATE FLATTEN(group) AS movieId,
AVG(votes.rating) AS avgRating, COUNT(votes) AS numVotes;
moviesWithVotes = JOIN movieVotes BY movieId,movies BY movieId;
movieRatings = FOREACH moviesWithVotes GENERATE movieVotes::movieId, movieVotes::avgRating,
movieVotes::numVotes, REGEX_EXTRACT(movies::releaseDate,'(.*)-(.*)-(.*)',3), movies::movieTitle;
STORE movieRatings INTO 'output/movieRatings' USING PigStorage('\t');
output/movieRatings:
1 3.8783185840707963 452 1995 Toy Story (1995)
2 3.2061068702290076 131 1995 GoldenEye (1995)
3 3.033333333333333 90 1995 Four Rooms (1995)
4 3.550239234449761 209 1995 Get Shorty (1995)
5 3.302325581395349 86 1995 Copycat (1995)
> ratings <- read.table('C:/work/ClearPoint/Data20/pig/output/movieRatings/part-r-00000', header=FALSE, sep='\t',
col.names=c('movieid','avgrating','numvotes','year','title'));
> summary(ratings);
movieid avgrating numvotes year title
Min. : 1.0 Min. :1.000 Min. : 1.00 Min. :1930 Butcher Boy, The (1998) : 2
1st Qu.: 326.0 1st Qu.:2.750 1st Qu.: 11.00 1st Qu.:1989 Desperate Measures (1998) : 2
Median : 697.0 Median :3.278 Median : 42.00 Median :1994 Money Talks (1997) : 2
Mean : 710.7 Mean :3.171 Mean : 73.93 Mean :1988 Substance of Fire, The (1996): 2
3rd Qu.: 984.0 3rd Qu.:3.741 3rd Qu.:101.00 3rd Qu.:1996 1-900 (1994) : 1
Max. :1682.0 Max. :5.000 Max. :583.00 Max. :1998 12 Angry Men (1957) : 1
NA's : 1 (Other) :803
Looks like we have some issues, source data has duplicate movies (with different movie Ids) and we are missing a year...
> smoothScatter(ratings$year,ratings$avgrating);
Doesn't look like it....
> cor(ratings$year,ratings$avgrating, use='complete.obs');
[1] -0.206226
Looks like there is some correlation
> cor(ratings$year,ratings$numvotes, use='complete.obs');
[1] -0.001756821
Only a very, very weak correlation between year and number of votes
> cor(ratings$avgrating,ratings$numvotes, use='complete.obs');
[1] 0.468041
Interesting there is quite a strong correlation between average rating and number of votes. Wonder why?
> ratings$decade <- trunc(ratings$year / 10) * 10;
> histogram(~avgrating|as.factor(decade),ratings,layout=c(1, 7));
REGISTER lib/piggybank.jar;
DEFINE LogLoader org.apache.pig.piggybank.storage.apachelog.CombinedLogLoader();
DEFINE HourExtractor org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('dd/MMM/yyyy:HH:mm:ss Z','yyyy-MM-dd HH:00','Pacific/Auckland');
accesslog = LOAD '../data/access.log' USING LogLoader AS (remoteAddr, remoteLogname, user, time, method,
uri, proto, status, bytes, referer, userAgent);
byHour = GROUP accesslog BY HourExtractor(time);
hourCounts = FOREACH byHour GENERATE group as hourOfDay, COUNT(accesslog) as numberOfHits;
hourCountsOrdered = ORDER hourCounts by hourOfDay;
STORE hourCountsOrdered INTO 'output/hitsbyhour' using PigStorage('\t');
output/hitsbyhour
2011-08-15 20:00 47
2011-08-15 21:00 9
2011-08-15 22:00 28
2011-08-15 23:00 48
2011-08-16 00:00 22
2011-08-16 01:00 74
2011-08-16 02:00 34
2011-08-16 03:00 27
2011-08-16 04:00 38
2011-08-16 05:00 32
2011-08-16 06:00 20
2011-08-16 07:00 33