mmas.github.io/querying-hive.html at master · mmas/mmas.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
<!DOCTYPE html>
<html lang="en" itemscope itemtype="http://schema.org/Article">
  <head>
    <title>Querying in Hive</title>


    <meta charset="utf-8">
    <meta property="og:title" content="Querying in Hive">
    <meta property="og:site_name" content="Modesto Mas | Blog">
    <meta property="og:image" content="https://mmas.github.io/images/profile.jpg">
    <meta property="og:image:width" content="200">
    <meta property="og:image:height" content="200">
    <meta property="og:url" content="https://mmas.github.io/querying-hive">
    <meta property="og:locale" content="en_GB">
    <meta name="twitter:image" content="https://mmas.github.io/images/profile.jpg">
    <meta name="twitter:url" content="https://mmas.github.io/querying-hive">
    <meta name="twitter:card" content="summary">
    <meta name="twitter:domain" content="mmas.github.io">
    <meta name="twitter:title" content="Querying in Hive">
    <meta name="description" content="In the previous entry we created some Hive tables and put data into them, here we are going to see how to retrieve, aggregate and filter data from our...">
    <meta name="twitter:description" content="In the previous entry we created some Hive tables and put data into them, here we are going to see how to retrieve, aggregate and filter data from our...">
    <meta property="og:description" content="In the previous entry we created some Hive tables and put data into them, here we are going to see how to retrieve, aggregate and filter data from our...">
    <meta name="keywords" content="data-analysis,data-warehousing,hadoop,hive">
    <meta property="og:type" content="blog">
    <meta name="viewport" content="width=device-width, initial-scale=1">

<meta property="og:type" content="article">
<meta property="article:author" content="https://github.com/mmas">
<meta property="article:section" content="data-analysis">
<meta property="article:tag" content="data-analysis,data-warehousing,hadoop,hive">
<meta property="article:published_time" content="2015-10-21">
<meta property="article:modified_time" content="2015-10-21">

    <link rel="stylesheet" type="text/css" href="/css/main.css">
    <script type="text/x-mathjax-config">
    MathJax.Hub.Config({
        CommonHTML: {
            scale: 93,
            showMathMenu: false
        },
        tex2jax: {
            "inlineMath": [["$","$"], ["\\(","\\)"]]
        }
    });
    </script>
    <script type="text/javascript" async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML"></script>
  </head>
  <body class="entry-detail">


    <header>
  <div>
    <img src="https://mmas.github.io/images/profile.jpg">
    <a class="brand" href="/">Modesto Mas</a>
    <span>Data/Python/DevOps Engineer</span>
    <nav>
      <ul>
        <li><a href="/tags">Tags</a></li>
        <li><a href="https://github.com/mmas/mmas.github.io/issues" target="_blank">Issues</a></li>
      </ul>
    </nav>
  </div>
</header>


    <section id="content" role="main">


<article>

<header>
  <h1><a href="/querying-hive">Querying in Hive</a></h1>
  <time datetime="2015-10-21">Oct 21, 2015</time>

  <a class="tag" href="/tags?tag=data-analysis">data-analysis</a>

  <a class="tag" href="/tags?tag=data-warehousing">data-warehousing</a>

  <a class="tag" href="/tags?tag=hadoop">hadoop</a>

  <a class="tag" href="/tags?tag=hive">hive</a>

</header>

  <aside id="article-nav"></aside>
  <section class="body">
    <p>In the previous entry we created some Hive tables and put data into them, here we are going to see how to retrieve, aggregate and filter data from our tables by different examples.</p>
<p>The data used comes from <a target="_blank" href="http://www.who.org/">World Health Organization</a> and represents many fields about demographics, socioeconomisc, risks... by country. This edited version can be found in <a target="_blank" href="http://www.exploredata.net/Downloads/WHO-Data-Set">http://www.exploredata.net/Downloads/WHO-Data-Set</a>. I've simplified the version and created a Hive table using the following mapper function and HiveQL commands:</p>
<pre><code class="python">#!/usr/bin/env python

import sys

for line in sys.stdin:
    if line.startswith('&quot;'):
        index = line.find('&quot;', 1)
        fields = [line[1:index]]+line[index+2:].split(',')
    else:
        fields = line.split(',')
    print '\t'.join([
        fields[1],  # Id.
        fields[0],  # Name.
        fields[2],  # Continent id.
        str(int(fields[8])*1000) if fields[8] else '',  # Population.
        str(float(fields[10])/100) if fields[10] else '',  # Urban population.
        str(float(fields[14])/100) if fields[14] else '',  # Population &gt; 60.
        str(float(fields[15])/100) if fields[15] else '',  # Population &lt; 15.
        fields[5],  # GNI per capita.
        fields[129]])  # Life expectancy.
</code></pre>

<pre><code class="hiveql">CREATE TABLE who_raw (value STRING)
TBLPROPERTIES('skip.header.line.count'='1');

CREATE TABLE who (
    country_id INT,
    country_name STRING,
    continent_id INT,
    population INT,
    population_urban FLOAT,
    population_60 FLOAT,
    population_15 FLOAT,
    gni FLOAT,
    life_expectancy INT);

LOAD DATA LOCAL INPATH '/opt/data/who/WHO.csv'
INTO TABLE who_raw;

FROM who_raw
INSERT OVERWRITE TABLE who
MAP value
USING '/opt/data/who/mapper.py'
AS country_id, country_name, continent_id, population, population_urban,
population_60, population_15, gni, life_expectancy;

DROP TABLE who_raw;
</code></pre>

<h2>Select</h2>
<p>The basic way to query data in Hive is using <code>SELECT</code>:</p>
<pre><code class="hiveql">SELECT * FROM who LIMIT 12;
</code></pre>

<pre><code class="stdout">1   Afghanistan 1   26088000    0.23    0.47    0.06    NULL    42
2   Albania 2   3172000 0.46    0.26    0.9 6000.0  71
3   Algeria 3   33351000    0.64    0.29    0.9 5940.0  71
4   Andorra 2   74000   0.93    0.14    0.9 NULL    82
5   Angola  3   16557000    0.54    0.46    0.29    3890.0  41
6   Antigua and Barbuda 4   84000   0.37    0.28    NULL    15130.0 73
7   Argentina   5   39134000    0.9 0.26    0.9 11670.0 75
8   Armenia 2   3010000 0.64    0.2 0.9 4950.0  69
9   Australia   6   20530000    0.88    0.19    0.9 33940.0 82
10  Austria 2   8327000 0.66    0.16    0.9 36040.0 80
11  Azerbaijan  2   8406000 0.52    0.24    0.9 5430.0  64
12  Bahamas 4   327000  0.91    0.27    NULL    NULL    74
</code></pre>

<p>Retrieve only country name and population:</p>
<pre><code class="hiveql">SELECT country_name, population FROM who LIMIT 12;
</code></pre>

<pre><code class="stdout">Afghanistan 26088000
Albania 3172000
Algeria 33351000
Andorra 74000
Angola  16557000
Antigua and Barbuda 84000
Argentina   39134000
Armenia 3010000
Australia   20530000
Austria 8327000
Azerbaijan  8406000
Bahamas 327000
</code></pre>

<p>Retrieve unique values:</p>
<pre><code class="hiveql">SELECT DISTINCT continent_id FROM who;
</code></pre>

<pre><code class="stdout">1
2
3
4
5
6
7
</code></pre>

<p>Filtering with <code>WHERE</code>:</p>
<pre><code class="hiveql">SELECT country_name FROM who WHERE continent_id = 4;
</code></pre>

<pre><code class="stdout">Antigua and Barbuda
Bahamas
Barbados
Bermuda
Canada
Mexico
United States of America
</code></pre>

<pre><code class="hiveql">SELECT country_name FROM who WHERE population &gt; 100000000;
</code></pre>

<pre><code class="stdout">Bangladesh
Brazil
China
India
Indonesia
Japan
Mexico
Nigeria
Pakistan
Russia
United States of America
</code></pre>

<h2>Select subqueries</h2>
<p>Nest queries to create more complex queries:</p>
<pre><code class="hiveql">WITH q AS (SELECT * FROM who WHERE continent_id = 1)
SELECT country_name, population
FROM q;
</code></pre>

<p>or </p>
<pre><code class="hiveql">SELECT country_name, population
FROM (SELECT * FROM who WHERE continent_id = 1) q;
</code></pre>

<pre><code class="stdout">Afghanistan 26088000
Bahrain 739000
Cape Verde  519000
Djibouti    819000
Egypt   74166000
Iran (Islamic Republic of)  70270000
Iraq    28506000
Israel  6810000
Jordan  5729000
Kuwait  2779000
Libyan Arab Jamahiriya  6039000
Malta   405000
Morocco 30853000
Oman    2546000
Qatar   821000
Saudi Arabia    24175000
Syria   19408000
Tunisia 10215000
United Arab Emirates    4248000
West Bank and Gaza  NULL
Yemen   21732000
</code></pre>

<h2>Joins</h2>
<p>To do some examples to combine rows, first, create continents table to combine together with the "who" table:</p>
<pre><code class="hiveql">CREATE TABLE continents (id INT, name STRING);

INSERT INTO TABLE continents
VALUES (1, 'Middle East'),
       (2, 'Europe'),
       (3, 'Africa'),
       (4, 'North America'),
       (5, 'South America'),
       (6, 'Oceania'),
       (7, 'Asia');
</code></pre>

<p>Display the names of the countries:</p>
<pre><code class="hiveql">SELECT t1.country_name AS country, t2.name AS continent
FROM who t1
JOIN continents t2
ON t1.continent_id = t2.id
LIMIT 12;
</code></pre>

<p>or an equivalent with an implicit join:</p>
<pre><code class="hiveql">SELECT t1.country_name AS country, t2.name AS continent
FROM who t1, continents t2
WHERE t1.continent_id = t2.id
LIMIT 12;
</code></pre>

<pre><code class="stdout">Afghanistan Middle East
Albania Europe
Algeria Africa
Andorra Europe
Angola  Africa
Antigua and Barbuda North America
Argentina   South America
Armenia Europe
Australia   Oceania
Austria Europe
Azerbaijan  Europe
Bahamas North America
</code></pre>

<p>In our case, an outer join (left or right) will make the same effect joining the tables "who" and "continents", since all the countries have defined a continent id found in the "continents" table and all the ids from the "continents" table can be found in the "who" table. Just take into account that <code>JOIN</code> will return the matches in both tables and <code>LEFT JOIN</code> will return all the rows in the left table, <code>RIGHT JOIN</code> all the rows of the right table and <code>FULL JOIN</code> all the rows in both tables. If there's no match in a table it will return <code>NULL</code>. For the <code>CROSS JOIN</code>, it will return all the row combinations of the left an right tables, which is the same as an inner join with no <code>ON</code> condition.</p>
<h2>Unions</h2>
<p>Unions merge data vertically. Let's create a table with the countries with a GNI greater than $40,000 to do some examples with unions:</p>
<pre><code class="hiveql">CREATE TABLE countries_gni40000 (name STRING, gni INT);

INSERT OVERWRITE TABLE countries_gni40000
SELECT country_name, gni FROM who WHERE gni &gt;= 40000;

SELECT * FROM countries_gni40000;
</code></pre>

<pre><code class="stdout">Brunei Darussalam   49900
Kuwait  48310
Luxembourg  60870
Norway  50070
Singapore   43300
Switzerland 40840
United States of America    44070
</code></pre>

<p>Display the north american countries and the countries with a GNI greater than $40,000:</p>
<pre><code class="hiveql">SELECT t1.country_name AS name
FROM who t1
WHERE t1.continent_id = 4
UNION ALL
SELECT t2.name AS name
FROM countries_gni40000 t2;
</code></pre>

<pre><code class="stdout">Antigua and Barbuda
Bahamas
Barbados
Bermuda
Canada
Mexico
United States of America
Brunei Darussalam
Kuwait
Luxembourg
Norway
Singapore
Switzerland
United States of America
</code></pre>

<p>Same query but removing the duplicates:</p>
<pre><code class="hiveql">SELECT DISTINCT name FROM (
    SELECT country_name AS name
    FROM who
    WHERE continent_id = 4
    UNION ALL
    SELECT name AS name
    FROM countries_gni40000
) q;
</code></pre>

<pre><code class="stdout">Antigua and Barbuda
Bahamas
Barbados
Bermuda
Brunei Darussalam
Canada
Kuwait
Luxembourg
Mexico
Norway
Singapore
Switzerland
United States of America
</code></pre>

<h2>Order and sort</h2>
<p>The difference between <code>ORDER BY</code> and <code>SORT BY</code> is that the first one the sorted order is applied after the reducer and the in the second one it is applied before.</p>
<p>Sort/order countries by name:</p>
<pre><code class="hiveql">SELECT name FROM countries_gni40000 ORDER BY name DESC;
</code></pre>

<p>or:</p>
<pre><code class="stdout">SELECT name FROM countries_gni40000 SORT BY name DESC;
</code></pre>

<pre><code class="stdout">United States of America
Switzerland
Singapore
Norway
Luxembourg
Kuwait
Brunei Darussalam
</code></pre>

<p>Order by continent, country name and create a rank with reversed order by population, aggregated using <code>PARTITION BY</code> (do not mix up with partitions when creating tables; more about aggregations in the next section):</p>
<pre><code class="hiveql">SELECT
    continent_id,
    country_name,
    rank() OVER (PARTITION BY continent_id ORDER BY population DESC) AS rank,
    population
FROM who
ORDER BY continent_id, country_name
LIMIT 30;
</code></pre>

<pre><code class="stdout">1   Afghanistan 5   26088000
1   Bahrain 18  739000
1   Cape Verde  19  519000
1   Djibouti    17  819000
1   Egypt   1   74166000
1   Iran (Islamic Republic of)  2   70270000
1   Iraq    4   28506000
1   Israel  10  6810000
1   Jordan  12  5729000
1   Kuwait  14  2779000
1   Libyan Arab Jamahiriya  11  6039000
1   Malta   20  405000
1   Morocco 3   30853000
1   Oman    15  2546000
1   Qatar   16  821000
1   Saudi Arabia    6   24175000
1   Syria   8   19408000
1   Tunisia 9   10215000
1   United Arab Emirates    13  4248000
1   West Bank and Gaza  21  NULL
1   Yemen   7   21732000
2   Albania 39  3172000
2   Andorra 49  74000
2   Armenia 40  3010000
2   Austria 23  8327000
2   Azerbaijan  22  8406000
2   Belarus 20  9742000
2   Belgium 16  10430000
2   Bosnia and Herzegovina  36  3926000
2   Bulgaria    24  7693000
</code></pre>

<h2>Aggregation</h2>
<p>Number of countries with a life expectancy bigger than 80 years:</p>
<pre><code class="hiveql">SELECT count(*) AS total
FROM who
WHERE life_expectancy &gt; 80;
</code></pre>

<pre><code class="stdout">13
</code></pre>

<p>Number of countries by continent:</p>
<pre><code class="hiveql">SELECT continent_id, count(*) AS countries
FROM who
GROUP BY continent_id;
</code></pre>

<pre><code class="stdout">1   21
2   51
3   48
4   7
5   31
6   35
7   9
</code></pre>

<p>Multiple aggregate functions. Minimum and maximum population in a country by continent:</p>
<pre><code class="hiveql">SELECT continent_id, min(population), max(population)
FROM who
GROUP BY continent_id;
</code></pre>

<pre><code class="stdout">1   405000  74166000
2   31000   143221000
3   86000   144720000
4   84000   302841000
5   50000   189323000
6   2000    228864000
7   300000  1328474000
</code></pre>

<p>Conditionals. Number of countries with largely urban population:</p>
<pre><code class="hiveql">SELECT sum(CASE WHEN population_urban &gt; 0.5 THEN 1 ELSE 0 END)
FROM who;
</code></pre>

<p>or:</p>
<pre><code class="hiveql">SELECT sum(if(population_urban &gt; 0.5, 1, 0))
FROM who;
</code></pre>

<pre><code class="stdout">112
</code></pre>

<p>Unique values. Count continents:</p>
<pre><code class="hiveql">SELECT count(DISTINCT continent_id)
FROM who;
</code></pre>

<p>or, with better performance for big tables using multiple reducers:</p>
<pre><code class="hiveql">SELECT count(*)
FROM (SELECT DISTINCT continent_id FROM who) q;
</code></pre>

<pre><code class="stdout">7
</code></pre>

<p>When using groups and unions, the next operation will be done in different stages:</p>
<pre><code class="hiveql">SELECT continent_id, NULL AS life_expectancy, count(country_name)
FROM who
GROUP BY continent_id
UNION ALL
SELECT NULL AS continent_id, life_expectancy, count(country_name)
FROM who
GROUP BY life_expectancy
LIMIT 12;
</code></pre>

<p>for a better performance use <code>GROUPING SETS</code>, then all processes will be completed in one single stage job.</p>
<pre><code class="hiveql">SELECT continent_id, life_expectancy, count(country_name)
FROM who
GROUP BY continent_id, life_expectancy
GROUPING SETS(continent_id, life_expectancy)
LIMIT 12;
</code></pre>

<pre><code class="stdout">1   NULL    1
1   42  1
1   56  2
1   61  1
1   68  1
1   70  2
1   71  2
1   72  4
1   74  1
1   75  1
1   77  1
1   78  2
</code></pre>

<p>For a more combinations of groups may be useful to use <code>ROLLUP</code> to combine n+1 aggregated columns:</p>
<pre><code class="hiveql">SELECT continent_id, life_expectancy, count(country_name)
FROM who
GROUP BY continent_id, life_expectancy WITH ROLLUP
LIMIT 12;
</code></pre>

<p>is equivalent to:</p>
<pre><code class="hiveql">SELECT continent_id, life_expectancy, count(country_name)
FROM who
GROUP BY continent_id, life_expectancy
GROUPING SETS((continent_id, life_expectancy), (continent_id), ())
LIMIT 12;
</code></pre>

<pre><code class="stdout">NULL    NULL    202
1   NULL    21
1   NULL    1
1   42  1
1   56  2
1   61  1
1   68  1
1   70  2
1   71  2
1   72  4
1   74  1
1   75  1
</code></pre>

<p>Following with the combinations of groups, <code>CUBE</code> will combine all the possible combinations (2^n):</p>
<pre><code class="hiveql">SELECT continent_id, life_expectancy, count(country_name)
FROM who
GROUP BY continent_id, life_expectancy WITH CUBE
LIMIT 12;
</code></pre>

<p>is equivalent to:</p>
<pre><code class="hiveql">SELECT continent_id, life_expectancy, count(country_name)
FROM who
GROUP BY continent_id, life_expectancy
GROUPING SETS((continent_id, life_expectancy), (continent_id), (life_expectancy), ())
LIMIT 12;
</code></pre>

<pre><code class="stdout">NULL    NULL    202
NULL    NULL    9
NULL    40  1
NULL    41  1
NULL    42  4
NULL    43  2
NULL    44  1
NULL    46  3
NULL    47  2
NULL    48  3
NULL    49  1
NULL    50  4
</code></pre>

<p>Filtered aggregation. Get the average life expectancy by continent when this is greated than 70. The next two operations are equivalent:</p>
<pre><code class="hiveql">SELECT continent_id, mean_le
FROM (
    SELECT continent_id, avg(life_expectancy) AS mean_le
    FROM who
    GROUP BY continent_id
) q
WHERE mean_le &gt; 70;
</code></pre>

<pre><code class="hiveql">SELECT continent_id, avg(life_expectancy) AS mean_le
FROM who
GROUP BY continent_id
HAVING mean_le &gt; 70;
</code></pre>

<pre><code class="stdout">2   74.88235294117646
4   75.83333333333333
5   71.65517241379311
</code></pre>
  </section>
</article>


    </section>
    <footer></footer>


<script type="text/javascript" src="/js/article.js"></script>

  </body>
</html>