大数据开发面试必会的SQL 30题！！！（二）-阿里云开发者社区

8,徐雨秋 9,孙皓然

需求：现在我们想通过table1表获取获奖员工的更多信息。

select * from test.`16_table1` inner join `16_table2` `16t2` on `16_table1`.id = `16t2`.id;

运行结果：

（17）计算用户留存情况

数据源：user_login.csv

uid,login_time
1,2021/4/21 6:00
1,2021/4/24 10:00
1,2021/4/25 19:00
2,2021/4/22 10:00
2,2021/4/28 9:00
2,2021/4/29 14:00
3,2021/4/27 8:00
3,2021/4/28 10:00

需求：我们想看用户的次日留存数、三日留存数、七日留存数（只要用户首次登录以后再登录就算留存下来了），该怎么实现呢？

解题思路：本题有两种解题思路，先看思路一：按照用户时间求出七日留存，首先按uid分组，求出每个uid的第一次登陆时间和最后一次登陆时间，算出中间间隔的时间，如果间隔为1就是次日留存，间隔为3就是3日留存，间隔为7就是7日留存，以此类推分别求出他们的数量。思路二：按照当日时间求出七日留存，如果用户登陆的时间正好等于当前日期前一天的日期，则去重统计uid数量即为次日留存数，以此类推分别求出三日留存、七日留存。

-- 写法一，按照用户时间求出七日留存
select
       count(case when user_day_value.day_value = 1 then uid end ) as '次日留存',
       count(case when user_day_value.day_value = 3 then uid end ) as '三日留存',
       count(case when user_day_value.day_value = 7 then uid end ) as '七日留存'
     (select uid,
       max(login_time),
       min(login_time),
       datediff(date (max(login_time)) , date (min(login_time))) as day_value
        test.user_login
     group by





    
         uid) as user_day_value;
-- 写法二，按照当日时间求出七日留存
select
       count(distinct case when date(login_time) = date_sub(date (current_date),interval 1 day) then uid end ) as '次日留存',
       count(distinct case when date(login_time) > date_sub(date (current_date),interval 3 day) then uid end ) as '三日留存',
       count(distinct case when date(login_time) > date_sub(date (current_date),interval 7 day) then uid end ) as '七日留存'
from test.user_login;

运行结果：

写法一：

写法二：

（18）筛选最受欢迎的课程

数据源：course_table.csv

id,name,grade,course
1,王小凤,一年级,心理学
2,刘诗迪,二年级,心理学
3,李思雨,三年级,社会学
4,张文华,一年级,心理学
5,张青云,二年级,心理学
6,徐文杰,三年级,计算机
7,李智瑞,一年级,心理学
8,徐雨秋,二年级,计算机
9,孙皓然,三年级,社会学
10,李春山,一年级,社会学

需求：现在我们想知道最受欢迎的课程是哪一门。

解题思路：解法一通过按照course分组，按照course_count降序排序，取第一个值即为最受欢迎的课程，但是这种写法在一定程度上不严谨，比如说course_count最大值一样多，也就所说的并列第一，那么使用第一种方法就不能把所有最喜欢的课程都取出来。来看看第二种写法，通过使用rank窗口函数得出并列第一的值，通过where = 1，即可以全部取出最受欢迎的课程。

-- 18.筛选最受欢迎的课程
-- 解法一：不取重复值
select * from test.course_table;
select course,





    
       count(course) as course_count
     test.course_table
group by
         course
order by
         course_count desc limit 1;
-- 解法二：取重复值
select
       course,
       course_count
     (select
             course,
             count(course) as course_count,
             rank() over (order by count(course) desc ) as rank_num
          test.course_table
     group by
              course) as t
where rank_num = 1;

运行结果：

（19）筛选出每个年级最受欢迎的三门课程

数据源：course_table.csv

需求：现在我们想知道每个年级最受欢迎的三门课程

解题思路：通过运用窗口函数，分组求Top N

-- 19.筛选出每个年级最受欢迎的三门课程
-- 解法一：利用两个窗口函数
select
     (select
       a.grade,
       a.course,
       a.count_num,
       row_number() over (partition by grade order by count_num desc) as count_rank
          (select grade,





    
                  course,
                  count(course) over (partition by grade,course) as count_num
          from test.course_table) as a
     group by a.grade,a.course,a.count_num) as b
where
      b.count_rank < 4 ;
-- 解法二
select * from test.course_table;
select
     (select
       a.grade,
       a.course,
       a.count_num,
       row_number() over (partition by grade order by count_num desc) as count_rank from
        (select
          grade,
          course ,
          count(id) as count_num
             test.course_table
        group by
                 grade,
                 course) as a) as b
where b.count_rank < 4;

运行结果：

根据此题也可以求出每个年级最受欢迎的课程，这里我不用窗口函数，使用连接子查询：

select
     (select
             grade,
             course ,





    
             count(id) as count_num
          test.course_table
     group by
              grade,
              course) as t1
where
      t1.count_num =
      (select
              t2.count_num
      from (select
                   grade,
                   course ,
                   count(id) as count_num
           test.course_table
      group by
               grade,
               course) as t2
      where
            t1.grade = t2.grade
      order by
               t2.count_num desc limit 1);

运行结果：

（20）求累积和

数据源：consum_order_table.csv

order_id,uid,amount
201901,1,10
201902,2,20
201903,3,15
201904,3,15
201905,4,20





    
201906,4,20
201907,5,25
201908,5,25
201909,6,30
201910,6,30
201911,7,35
201912,7,35

需求：现在我们想看下80%的订单金额最少是由多少用户贡献的

解题思路：第一步按uid分组，求出每个uid的amount和，第二步在生成累积和的时候需要按照订单金额进行降序排列，这样就可以得到最少的人数，第三步利用子查询获取到全部的订单金额，求出小于总额80%的uid的数量

select
       count(uid)
     (select
       sum_amount,
       sum(sum_amount) over (order by sum_amount desc ) as consume_amount
            (select
                    sum(amount) as sum_amount
                 test.consum_order_table
            group by uid) as uid_table) as t
where
      t.consume_amount < (select (sum(amount) * 0.8) as sum_80 from test.consum_order_table);

运行结果：

（21）获取新增用户数

数据源：user_reg_table.csv

uid,reg_time
1,2019/12/25 10:00
2,2019/12/26 10:00
3,2019/12/27 10:00
4,2019/12/28 10:00
5,2019/12/29 10:00
6,2019/12/30 10:00





    
7,2019/12/31 10:00
8,2020/1/1 10:00
9,2020/1/2 10:00
10,2020/1/3 10:00
11,2020/1/4 10:00

需求：我们想获取某一天的新增用户数，以及该天对应的过去7天内每天的平均新增用户数

解题思路：我们以2020-1-1那一天为例，首先通过case when 让reg_time = 设定的那一天，求出uid的数量即为某一天的新增用户数。然后求出2020-1-1向前7天的reg_time，求出uid的数量除以7即可得出过去7天内每天的平均新增用户数

-- 假设求2020-1-1那一天新增的用户
set @day_date = '2020-01-01';
select
       count(case when date (reg_time) = @day_date then uid end ) as new_count,
       (count(uid) / 7) as 7_avg_count
     test.user_reg_table
where
      date (reg_time) between date_sub(@day_date,interval 6 day) and @day_date;

运行结果：

（22）获取用户首次购买时间

数据源：first_order_table.csv

order_id,uid,order_time
201901,1,2020-01-01 10:00:00
201902,2,2020-01-02 10:00:00
201903,3,2020-01-03 10:00:00
201904,1,2020-01-04 10:00:00
201905,2,2020-01-05 10:00:00
201906,3,2020-01-06 10:00:00
201907,1,2020-01-07 10:00:00
201908,2,2020-01-08 10:00:00
201909,3,2020-01-09 10:00:00
201910,1,2020-01-10 10:00:00





    
201911,2,2020-01-11 10:00:00

需求：我们想获取每个用户的首次购买时间，以及首次购买时间是否在最近7天内

解题思路：首先按uid分组，求出每个uid的首次订单时间，然后通过case when判断首次订单时间是否大于当日七天之前的时间，如果大于则首次购买时间在最近7天内，如果小于则首次购买时间不在最近7天内

select
       min(order_time) as first_time,
       (case when date_sub(current_date(),interval 7 day ) < min(date(order_time)) then 'yes' else 'no' end) as is_7_day
     test.first_order_table
group by

运行结果：

（23）同时获取用户和订单数据

数据源：user_reg_table.csv、first_order_table.csv

需求：现在我们想获取过去7天每天的新增用户数、订单数、下单用户数

解题思路：首先求过去7天每天新增用户数，再求过去7天每天新增订单数、下单用户数，两个表join即可得到需求

-- 假设以2020-01-04为例
set @day = '2020-01-04';
select
       a.tdate,
       a.count_uid,
       b.order_count,
       b.uesr_count
-- 求过去7天每天新增用户数
     (select
       date (reg_time) as tdate,
       count(uid) over(partition by date (reg_time)) as count_uid
          test.user_reg_table
     where
           date (reg_time) between date_sub(@day,interval 7 day) and @day) as a
left join
-- 求过去7天每天新增订单数、下单用户数
     (select





    
             date (order_time) as tdate,
             count(order_id) as order_count ,
             count(uid) as uesr_count
          test.first_order_table
     where
           date (order_time) between date_sub(@day,interval 7 day) and @day
     group by date (order_time)) as b
on a.tdate = b.tdate;

运行结果：

（24）随机抽样

数据源：user_reg_table.csv、first_order_table.csv

需求：现在我们想从用户表中随机抽取5个用户，并获取这5个用户的历史购买订单数

解题思路：我们要随机抽取5个用户并获取他们的历史购买订单数，首先需要生成每个用户的历史购买订单数，然后从中随机抽取5个。具体的思路为利用rand()函数生成随机数，然后利用order by进行排序，最后利用limit将前5条数据显示出来，运行结果如下表所示。

select
       user_reg_table.uid,
       t.order_count
     test.user_reg_table
left join
         (select
                 count(order_id) as order_count
              test.first_order_table
         group by
                  uid) as t
on user_reg_table.uid = t.uid
order by rand() limit 5;

运行结果：

（25）获取沉默用户数

数据源：user_reg_table.csv、first_order_table.csv

需求：现在我们想获取沉默用户的数量，沉默的定义是已注册但最近30天内没有购买记录的用户

解题思路：先求出最近三十天已经有购买记录的用户，再求出所有已经注册的用户。两个表连接，得出第一列为所有已经注册的用户，第二列则为最近三十天购买记录的用户，如果不能连接为null，最后求出null值的个数即为最近30天有购买记录的用户

-- 获取已经注册用户





    
select
       count(t1.uid) as slient_count
     test.user_reg_table as t1
-- 获取最近30天有购买记录的用户
left join
         (select
              test.first_order_table
         where
               date (order_time)
                   between current_date and date_sub(current_date,interval 30 day)
         group by uid) t2
    t1.uid = t2.uid
where
      t2.uid is null;

（26）获取新用户的订单数

数据源：user_reg_table.csv、first_order_table.csv

需求：现在我们想获取最近7天注册的新用户在最近7天内的订单数是多少

解题思路：第一步，求出最近7天注册的新用户，第二步，求出最近7天下订单的用户和订单数，第三步，两表连接，第二列为null值为最近7天新注册没下订单的用户，否则为下订单的用户，求出第三列的个数即为获取最近7天注册的新用户在最近7天内的订单数

select
       sum(t2.order_count)
-- 获取最近7天注册的新用户
       (select
            test.user_reg_table
       where date (reg_time) between current_date and date_sub(current_date,interval 7 day )) as t1
left join
-- 获取最近7天下订单的用户和订单数
       (select
               count(order_id) as order_count
            test.first_order_table
       where date (order_time) between current_date and date_sub(current_date,interval 7 day )





    
       group by
                uid) as t2
on t1.uid = t2.uid;

运行结果：

（27）获取借款到期名单

数据源：loan_table.csv

id,loan_time,expire_time,reback_time,amount,status
1,2019/12/1,2019/12/31,,2208,0
2,2019/12/1,2019/12/31,2019/12/31,5283,1
3,2019/12/5,2020/1/4,,5397,0
4,2019/12/5,2020/1/4,,4506,0
5,2019/12/10,2020/1/9,,3244,0
6,2019/12/10,2020/1/9,2020/1/12,4541,1
7,2020/1/1,2020/1/31,2020/1/10,3580,1
8,2020/1/1,2020/1/31,,7045,0
9,2020/1/5,2020/2/4,,2067,0
10,2020/1/5,2020/2/4,,7225,0

需求：这张表包含id、loan_time（借款时间）、expire_time（到期时间）、reback_time（还款时间）、amount（金额）和status（还款状态，1表示已还款、0表示未还款）六个字段，我们想获取每天到期的借款笔数、借款金额和平均借款天数

select
       count(id) as loan_count ,
       sum(amount) as loan_amount,
       avg(datediff(reback_time,loan_time)) as avg_day
     test.loan_table
where
      date (loan_time) = current_date;

运行结果：

（28）获取即将到期的借款信息

数据源：loan_table.csv

需求：我们想知道有多少笔借款会在未来7天内到期，以及其中有多少笔是已经还款的

解题思路：先获取今日到未来7天所有的还款日期，求出count(id)即为借款会在未来7天内到期的数量，在计算出目前状态为还款的数量。

select
       count(id) as loan_count,
       count(case when status = 1 then id end ) as reback_count
     test.loan_table
where
      expire_time between current_date and date_add(current_date,interval 7 day );

运行结果：

（29）获取历史逾期借款信息

数据源：loan_table.csv

需求：我们想知道历史逾期的借款笔数和金额，以及至今还逾期的借款笔数和金额

解题思路：这里面的关键信息在于，逾期怎么判断，对到期时间和还款时间进行比较，如果是逾期且现在已经还款的，可以直接比较到期时间和还款时间，如果还款时间大于到期时间，则说明是逾期的；还有一种是逾期且至今还未还款的，这种情况是没有还款时间的，也就是还款时间为空，但是到期时间是在今天之前，说明已经到期但是未还款。

select
       count(id) as loan_count,
       sum(amount) as loan_amount,
       count(case when status = 0 then id end ) as no_reback_count,
       sum(case when status = 0 then amount end ) as  no_reback_amount
     test.loan_table
where (current_date > expire_time and reback_time is null) or (reback_time > loan_table.expire_time);

运行结果：

（30）获取部门工资最高的员工

数据源：Employee.csv、Department.csv

Id,Name,Salary,DepartmentId
1,Joe,70000,1
2,Jim,90000,1
3,Henry,80000,2
4,Sam,60000,2
5,Max,90000,1

Id,Name





    
2,Sales

需求1：如何求出Employee表薪水最高的员工（含重复值）

-- 问题一：如何求出并列第一第二的值
-- 解法一：运用子查询
select
       a.Id,
       Name,
       Salary
     test.Employee a
         inner join
         (select
              test.Employee
         where
               Salary = (select max(Salary) from test.Employee)) b on a.Id = b.Id;
-- 解法二：运用查rank（）窗口函数
select
       Name,
       Salary
     (select
             Name,
             Salary,rank() over (order by Salary desc ) as rank_num
     from test.Employee) as t
where t.rank_num = 1;

运行结果：

需求2：求出部门工资最高的员工（含重复值）

-- 问题二：部门工资最高的员工
select
       t2.Department,
       t2.Employee,
       t2.Salary





    
     (select
             Department,
             Employee,
             Salary,
             rank() over (partition by Department order by Salary desc ) as rank_num
          (select
                  a.Name as Employee,
                  a.Salary as Salary ,
                  b.Name as Department
               test.Employee as a left join test.Department as b on a.DepartmentId = b.Id) as t) as t2
where rank_num = 1;

运行结果：

（补）求连续登陆三天的用户

表结构：

解题思路：

第一步首先按照id，和date分组，

第二步，使用窗口函数，按照id分区，按日期排序，

第三步，使用date_sub函数让date向前移动cum天

第四步，按照id,result_date分组，求出result_date的个数，并取出result_date>3的个数即为连续登陆3天以上的用户

具体代码：

select
       result_date,
       count(result_date) cnt_3
     (select
             date,
             date_sub(date,interval cum day ) as result_date
          (select
                  date(date) as date,
                  row_number() over (partition by id order by date(date)) as cum
               test.demo01
          group by
                   date(date)
          ) as date_table) as t1





    
group by
         result_date
having count(result_date) >= 3;

（补）某互联网公司面试题

问题：一张用户日活信息表，用户某一天的用户info为空，需要用该用户的最近一天的info去补，请问该如何实现

表结构：

dt         uid     info
20210811   001    1
20210811   002    0
20210811   003    1
20210811   004    1
20210812   001    null
20210812   002    0
20210812   003    null
20210813   002    1
20210813   003    0
20210813   004    1
20210814   001    0
20210814   003    1
20210814   004    null

比如说001用户12号的数据为null，就需要拿10号或者14号的数据去补，而不是拿15号

20210810   001    1      
20210812   001    null  
20210814   001    1  
20210815   001    1  
20210816   001    0  
20210817   001    null     
20210818   001    null  





    
20210819   001    null

思路：先做笛卡尔乘积，再取时间差的绝对值的最小值，再分组排序，取第一个

具体测试数据：

测试代码：

select
    ,info as info_null
from (
                  select dt1
                       , uid
                       , info_null
                       , info
                       , min(dt_diff)                                                   as min_dt_diff
                       , row_number() over (partition by dt1,uid order by min(dt_diff)) as rank_num
                  from (
                           select a.dt                      as dt1
                                , b.dt                      as dt2
                                , a.uid                     as uid
                                , a.info                    as info_null
                                , b.info                    as info
                                , abs(datediff(b.dt, a.dt)) as dt_diff
                               -- 计算出为空的数据
                                   select dt
                                        , uid
                                        , info
                                        , row_number() over (partition by uid order by dt) as rank_num
                                   from user_login_info
                                   where info is null
                                   left join
                               -- 计算出不为空的数据
                                       select dt
                                            , uid
                                            , info
                                            , row_number() over (partition by uid order by dt) as rank_num
                                       from user_login_info
                                       where info is not null
                                   ) as b
                               on a.uid = b.uid
                           group by a.dt
                                  , b.dt
                                  , a.uid
                                  , a.info
                                  , b.info
                       ) as t
                  group by dt1
                         , uid
                         , info_null
                         , info
              ) as abt
where rank_num = 1

运行结果：

但是我这里的写法可能不太好，因为用户日活数据表数据量可能比较大，在做笛卡尔积的时候会性能较低。

不需要编写代码，也能成为Hive SQL面试高手？ChatGPT告诉你...

当你面对 Hive SQL 面试时，不仅需要掌握 SQL 语言的基本知识，还需要熟练掌握 Hive SQL 的一些高级特性，比如窗口函数、分区等等。对于初学者而言，写出高效的 Hive SQL 代码往往是一件困难的事情，而这恰恰是面试官最为看重的。但是，你不必担心！现在，有一种神奇的工具——ChatGPT，可以帮助你快速生成 Hive SQL 代码，解决你在面试中遇到的各种难题。本文将会介绍如何使用 ChatGPT 生成 Hive SQL 代码，让你在面试中轻松成为 Hive SQL 面试高手，无需编写代码也能毫不费力地完成面试题。让我们一起来看看吧！

求职指南！给数据开发的SQL面试准备路径！⛵

大量的数据科学职位需要精通 SQL，它也是数据分析师、数据科学家、数据建模岗最常考核的面试技能。在本篇内容中 ShowMeAI 将梳理汇总所有面试 SQL 问题，按照不同的主题构建练习专项块。

面试现场！月薪3w+的这些数据挖掘SQL面试题你都掌握了吗？ ⛵

真实面试题分享！本文基于餐饮业数据，使用SQL分析挖掘客户的就餐模式、点餐花费和菜品喜好等，以提升餐厅的业务经营水平。

[-] .\Navicat-Cracker NavicatCrackerDlg.cpp:332 -3All patch solutions are suppressed. Patch abort!HI