在SQL中,窗函数(window function)或分析函数(analytic function)[1]是一个函数,它使用来自一行或多行的值来为每一行返回一个值。 与之形成对比,聚合函数为多行返回单个值。窗口函数有一个OVER子句;任何没有OVER子句的函数都不是窗口函数,而是聚合函数或单行(标量)函数。[2]
例子1
编辑例如,这里有一个查询,它使用一个窗口函数来比较每个员工的工资与其部门的平均工资(来自PostgreSQL文档的示例):[3]
SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) FROM empsalary;
输出:
depname | empno | salary | avg ----------+-------+--------+---------------------- develop | 11 | 5200 | 5020.0000000000000000 develop | 7 | 4200 | 5020.0000000000000000 develop | 9 | 4500 | 5020.0000000000000000 develop | 8 | 6000 | 5020.0000000000000000 develop | 10 | 5200 | 5020.0000000000000000 personnel | 5 | 3500 | 3700.0000000000000000 personnel | 2 | 3900 | 3700.0000000000000000 sales | 3 | 4800 | 4866.6666666666666667 sales | 1 | 5000 | 4866.6666666666666667 sales | 4 | 4800 | 4866.6666666666666667 (10 rows)
PARTITION BY子句将行分组,并且该函数分别应用于每个分组。 如果PARTITION BY子句被省略(例如如果我们有一个空的OVER()子句),那么整个结果集被视为单个分组。[4]对于此查询,报告的平均工资将是所有行的平均值。
语法
编辑根据PostgreSQL文档,窗函数具有下列之一的语法:[4]
function_name ([expression [, expression ... ]]) OVER window_name
function_name ([expression [, expression ... ]]) OVER ( window_definition )
function_name ( * ) OVER window_name
function_name ( * ) OVER ( window_definition )
其中function_name包括:
- 开窗函数
- 序号函数
- ROW_NUMBER:顺序排序——1、2、3
- RANK:并列排序,跳过重复序号——1、1、3
- DENSE_RANK:并列排序,不跳过重复序号——1、1、2
- 分布函数
- PERCENT_RANK
- CUME_DIST
- 前驱后继函数
- LAG(expr,n):返回当前行的前n行的expr的值
- LEAD(expr,n):返回当前行的后n行的expr的值
- 头尾函数
- FIRST_VALUE(expr)
- LAST_VALUE(expr)
- 其他函数
- NTH_VALUE(expr,n):返回第n行的expr值
- NTILE(n):将有序数据分为n个桶,记录等级数
- 序号函数
- 所有聚合函数
其中 window_definition具有语法:
[ existing_window_name ]
[ PARTITION BY expression [, ...] ]
[ ORDER BY expression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...] ]
[ frame_clause ]
frame_clause具有下列之一的语法:
{ RANGE | ROWS | GROUPS } frame_start [ frame_exclusion ]
{ RANGE | ROWS | GROUPS } BETWEEN frame_start AND frame_end [ frame_exclusion ]
frame_start与frame_end可以是UNBOUNDED PRECEDING, offset PRECEDING, CURRENT ROW, offset FOLLOWING, UNBOUNDED FOLLOWING. frame_exclusion可以是EXCLUDE CURRENT ROW, EXCLUDE GROUP, EXCLUDE TIES, EXCLUDE NO OTHERS.
expression指不包含窗函数调用的任何表达式。
注:
- 方括号[]指可选子句
- 圆括号{}指多种选项的集合,选项之间以竖杠|分割
以下为各个语法项的举例说明:
WITH t AS (
SELECT * FROM (VALUES
('A', DATE '2024-01-01', 10),
('A', DATE '2024-01-02', 10),
('A', DATE '2024-01-02', 30),
('A', DATE '2024-01-03', 20),
('B', DATE '2024-01-01', 5),
('B', DATE '2024-01-02', 7),
('B', DATE '2024-01-02', 7),
('B', DATE '2024-01-04', 1)
) AS v(dept, dt, amt)
)
SELECT
dept, dt, amt,
SUM(amt) OVER w AS running_sum, -- 复用窗口名
AVG(amt) OVER w AS running_avg -- 复用窗口名
FROM t
WINDOW w AS (PARTITION BY dept); -- 定义窗口名
WITH t AS (...) -- 具体定义见前例
SELECT
dept, dt, amt,
SUM(amt) OVER w1 AS sum_by_dept_date,
SUM(amt) OVER w2 AS sum_by_dept_date_amt
FROM t
WINDOW
w1 AS (PARTITION BY dept ),
w2 AS (w1 ORDER BY dt, amt); -- 继承已有窗口w1,再扩展 order by
WITH t AS (
SELECT * FROM (VALUES
('A', DATE '2024-01-01', 10),
('A', NULL::date, 99),
('A', DATE '2024-01-02', 20)
) v(dept, dt, amt)
)
SELECT
dept, dt, amt,
ROW_NUMBER() OVER (PARTITION BY dept ORDER BY dt NULLS LAST) AS null_last, -- NULLS FIRST / NULLS LAST(处理排序中 NULL 的位置)
ROW_NUMBER() OVER (PARTITION BY dept ORDER BY dt NULLS FIRST) AS null_first
FROM t;
WITH t AS (SELECT * FROM (VALUES ('b'),('a'),('c')) v(x))
SELECT x,
ROW_NUMBER() OVER (ORDER BY x USING <) AS rn -- USING operator(PostgreSQL:用特定比较运算符排序,较少用)
-- 一般直接 ORDER BY x 就够,USING 多用于自定义类型或特殊排序语义。
FROM t;
frame子句,定义了窗口“帧”,决定聚合函数在每行看到哪些行。多数数据库对 ORDER BY + 聚合有默认frame(PostgreSQL默认是RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)。frame可由ROWS、RANGES、GROUPS3种方式定义。
WITH t AS (...)
SELECT
dept, dt, amt,
SUM(amt) OVER (
PARTITION BY dept
ORDER BY dt, amt
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW -- 从分区开头到当前行,按“物理行”定义帧
) AS running_sum_rows
FROM t;
WITH t AS (...)
SELECT
dept, dt, amt,
AVG(amt) OVER (
PARTITION BY dept
ORDER BY dt, amt
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING -- 滑动窗口:前1行到后1行
) AS moving_avg_3rows
FROM t;
WITH t AS (...)
SELECT
dept, dt, amt,
SUM(amt) OVER (
PARTITION BY dept
ORDER BY dt, amt
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING -- 从当前行到分区末尾
) AS suffix_sum
FROM t;
RANGE:按“排序键的值的范围”定义帧(会把ties一起算进去)。排序键值与当前行相同的行叫做tie。RANGE 的 offset 类型必须与排序键可相加/可比较(日期用 interval;数值用数字)。
WITH t AS (...)
SELECT
dept, dt, amt,
SUM(amt) OVER (
PARTITION BY dept
ORDER BY dt
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_sum_range
FROM t;
WITH t AS (...)
SELECT
dept, dt, amt,
SUM(amt) OVER (
PARTITION BY dept
ORDER BY dt
RANGE BETWEEN INTERVAL '1 day' PRECEDING AND CURRENT ROW -- 按值范围:昨天和今天
) AS sum_last_2days
FROM t;
group:在ORDER BY下排序键相同的行构成一组(ties 组)。
WITH t AS (...)
SELECT
dept, dt, amt,
SUM(amt) OVER (
PARTITION BY dept
ORDER BY dt
GROUPS BETWEEN 1 PRECEDING AND CURRENT ROW -- 按“组”滑动,当前组以及前一组
) AS sum_prev_group_and_curr_group
FROM t;
frame_exclusion:EXCLUDE ...(把某些行从帧中排除)
SUM(amt) OVER (
PARTITION BY dept ORDER BY dt
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
EXCLUDE NO OTHERS -- 默认:不排除
)
WITH t AS (...)
SELECT
dept, dt, amt,
SUM(amt) OVER (
PARTITION BY dept
ORDER BY dt, amt
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
EXCLUDE CURRENT ROW -- 排除当前行。也即:包含3行的滑窗里把当前行剔除,只加左右邻居。
) AS sum_neighbors_only
FROM t;
WITH t AS (...)
SELECT
dept, dt, amt,
SUM(amt) OVER (
PARTITION BY dept
ORDER BY dt
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
EXCLUDE GROUP -- 排除当前 peer group:即与当前行 ORDER BY 值相同的所有行。解释:累计到当前 dt,但把当前 dt 这组(同 dt 的所有行)排除。
) AS sum_before_current_date_group
FROM t;
WITH t AS (...)
SELECT
dept, dt, amt,
SUM(amt) OVER (
PARTITION BY dept
ORDER BY dt
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
EXCLUDE TIES -- 排除同组里除当前行外的 ties。解释:如果当前 dt 有多行,只把“同 dt 的其它行”剔除,但当前行保留。
) AS sum_excluding_peers_except_self
FROM t;
例子2
编辑窗函数允许访问恰在当前记录之前和之后的记录的数据。[5][6][7][8] 一个窗函数定义当前行周围具有给定行数的帧或窗,并跨窗对数据集执行计算。[9][10]
NAME |
------------
Aaron| <-- Preceding (unbounded)
Andrew|
Amelia|
James|
Jill|
Johnny| <-- 1st preceding row
Michael| <-- Current row
Nick| <-- 1st following row
Ophelia|
Zach| <-- Following (unbounded)
在上表中,下一个查询为每一行提取具有前一行和后一行的窗口的值w:
SELECT
LAG(name, 1)
OVER(ORDER BY name) "prev",
name,
LEAD(name, 1)
OVER(ORDER BY name) "next"
FROM people
ORDER BY name
查询结果为:
| PREV | NAME | NEXT | |----------|----------|----------| | (null)| Aaron| Andrew| | Aaron| Andrew| Amelia| | Andrew| Amelia| James| | Amelia| James| Jill| | James| Jill| Johnny| | Jill| Johnny| Michael| | Johnny| Michael| Nick| | Michael| Nick| Ophelia| | Nick| Ophelia| Zach| | Ophelia| Zach| (null)|
例子3
编辑# 首先创建虚拟的业务员销售数据
CREATE TABLE Sales
(
idate date,
iname char(2),
sales int
);
# 向表中插入数据
INSERT INTO Sales VALUES
('2021/1/1', '丁一', 200),
('2021/2/1', '丁一', 180),
('2021/2/1', '李四', 100),
('2021/3/1', '李四', 150),
('2021/2/1', '刘猛', 180),
('2021/3/1', '刘猛', 150),
('2021/1/1', '王二', 200),
('2021/2/1', '王二', 180),
('2021/3/1', '王二', 300),
('2021/1/1', '张三', 300),
('2021/2/1', '张三', 280),
('2021/3/1', '张三', 280);
# 查询各月中销售业绩最差的业务员
SELECT * FROM
(SELECT month(idate),iname,sales,
ROW_NUMBER() OVER(PARTITION BY month(idate) ORDER BY sales) as sales_order
FROM Sales) as t
WHERE sales_order=1;
例子4
编辑求用户连续登录天数
# 首先创建虚拟的用户登录表,并插入数据
create table user_login
(
user_id varchar(100),
login_time datetime
);
insert into user_login values
(1,'2020-11-25 13:21:12'),
(1,'2020-11-24 13:15:22'),
(1,'2020-11-24 10:30:15'),
(1,'2020-11-24 09:18:27'),
(1,'2020-11-23 07:43:54'),
(1,'2020-11-10 09:48:36'),
(1,'2020-11-09 03:30:22'),
(1,'2020-11-01 15:28:29'),
(1,'2020-10-31 09:37:45'),
(2,'2020-11-25 13:54:40'),
(2,'2020-11-24 13:22:32'),
(2,'2020-11-23 10:55:52'),
(2,'2020-11-22 06:30:09'),
(2,'2020-11-21 08:33:15'),
(2,'2020-11-20 05:38:18'),
(2,'2020-11-19 09:21:42'),
(2,'2020-11-02 00:19:38'),
(2,'2020-11-01 09:03:11'),
(2,'2020-10-31 07:44:55'),
(2,'2020-10-30 08:56:33'),
(2,'2020-10-29 09:30:28');
# 第一种情况:查看每位用户连续登陆的情况。包括每位用户连续登录的情况、查看每位用户最大连续登录的天数、查看在某个时间段里连续登录天数超过N天的用户
select user_id,
min(login_date) start_date,
max(login_date) end_date,
count(login_date) days # 计算每位用户连续登录天数
from (select *,date_sub(login_date, interval irank day) idate # 增加辅助列,以判断用户是否连续登录
from (select *,rank() over(partition by user_id order by login_date) irank # 对每个用户的登录日期排序
from (select distinct user_id, date(login_time) login_date from user_login
# 数据预处理:由于统计的窗口期是天数,对登录时间字段格式转换为日期然后去重(去掉用户一天内多次登录的情况)
) as a
) as b
) as c
group by user_id,idate;
# 第二种情况:计算每个用户最大连续登录天数
# 第三种情况:查看给定时间段内连续登录天数≥5天的用户
# 找出相差天数为5的记录
select distinct user_id
from (select *,datediff(idate5,login_date)+1 as days # 计算当前登录日期与之后第4次登陆的日期的差值
from (select *,lead(login_date,4) over(partition by user_id order by login_date) idate5 # 获取当前登录日期之后第4次登陆的日期
from user_logrin_date)
as a)
as b
where days = 5;
历史
编辑SQL:2003引入了窗函数,其后的标准扩展了其功能。[11]
MySQL从8.0开始引入了窗函数。目前支持的语法为:
# 开窗函数语法
func_name(<parameter>) OVER([PARTITION BY <part_by_condition>] [ORDER BY <order_by_list> ASC|DESC])
参考文献
编辑- ^ 1.0 1.1 Analytic function concepts in Standard SQL | BigQuery. Google Cloud. [2021-03-23]. (原始内容存档于2022-03-24) (英语).
- ^ Window Functions. sqlite.org. [2021-03-23]. (原始内容存档于2022-11-17).
- ^ 3.5. Window Functions. PostgreSQL Documentation. 2021-02-11 [2021-03-23]. (原始内容存档于2022-10-31) (英语).
- ^ 4.0 4.1 4.2. Value Expressions. PostgreSQL Documentation. 2021-02-11 [2021-03-23]. (原始内容存档于2022-10-10) (英语).
- ^ Leis, Viktor; Kundhikanjana, Kan; Kemper, Alfons; Neumann, Thomas. Efficient Processing of Window Functions in Analytical SQL Queries. Proc. VLDB Endow. June 2015, 8 (10): 1058–1069. ISSN 2150-8097. doi:10.14778/2794367.2794375.
- ^ Cao, Yu; Chan, Chee-Yong; Li, Jie; Tan, Kian-Lee. Optimization of Analytic Window Functions. Proc. VLDB Endow. July 2012, 5 (11): 1244–1255. ISSN 2150-8097. arXiv:1208.0086 . doi:10.14778/2350229.2350243.
- ^ Probably the Coolest SQL Feature: Window Functions. Java, SQL and jOOQ. 2013-11-03 [2017-09-26]. (原始内容存档于2021-06-24) (美国英语).
- ^ Window Functions in SQL - Simple Talk. Simple Talk. 2013-10-31 [2017-09-26]. (原始内容存档于2021-09-22) (美国英语).
- ^ SQL Window Functions Introduction. Apache Drill. (原始内容存档于2022-10-10).
- ^ PostgreSQL: Documentation: Window Functions. www.postgresql.org. [2020-04-04]. (原始内容存档于2022-11-02) (英语).
- ^ Window Functions Overview. MariaDB KnowledgeBase. [2021-03-23]. (原始内容存档于2022-10-15).