用一套语法描述任意图形的方法诞生了!
图表不是一个单独的实体,统计图形的定义依靠以下几个基础语法:
声明 | 描述 |
---|---|
DATA | 从数据集生成视觉编码的数据操作 |
TRANS | 视觉编码变换(譬如rank) |
SCALE | 度量变换(譬如log) |
COORD | 定义坐标系(譬如极坐标) |
ELEMENT | 图形(譬如点图)及其视觉属性(譬如color) |
GUIDE | 辅助元素(譬如legend) |
环境 | 实现 |
---|---|
R | ggplot2 |
JSON | Vega |
Tableau | VuzQL |
Javascript | G2 |
Python | plotnine/Bokeh |
Plot(图)= data(数据集)+ Aesthetics(美学映射)+ Geometry(几何对象)
假设我们有以下数据:
City | Region | Price | Volume | Sales | |
---|---|---|---|---|---|
0 | Beijing | North | 11 | 8.04 | 88.44 |
1 | Shanghai | East | 8 | 6.95 | 55.60 |
2 | Guangzhou | South | 13 | 7.58 | 98.54 |
3 | Shenzhen | South | 8 | 8.81 | 70.48 |
4 | Tianjin | South | 11 | 9.33 | 102.63 |
5 | Chongqing | North | 14 | 9.96 | 139.44 |
from plotnine import ggplot, aes
from plotnine.geoms import *
ggplot(data, aes(x='Price', y='Sales')) + geom_point()
<ggplot: (149469665196)>
我们将Region映射到color
ggplot(data, aes(x='Price', y='Sales', color='Region')) + geom_point()
<ggplot: (149468623954)>
我们将Volume映射到size
(ggplot(data, aes(x='Price', y='Sales',
color='Region', size='Volume'))
+ geom_point()
)
<ggplot: (149462919075)>
我们依据Region的不同进行分面
from plotnine.facets import *
ggplot(data, aes(x='Price', y='Sales',
color='Region', size='Volume')) +\
geom_point() +\
facet_wrap('Region')
<ggplot: (149461057034)>
- 每种几何对象,默认对应一种统计变换;
- 每种统计变换,默认对应一个几何对象。
针对每一个价位,计算销量的均值
from plotnine.stats import *
import numpy as np
ggplot(data, aes(x='Price', y='Sales')) +\
stat_summary(fun_y=np.mean, geom='bar')
<ggplot: (149466799397)>
plotnine
的坐标系统功能尚不完整将x和y坐标翻转
from plotnine.coords import *
ggplot(data, aes(x='Price', y='Sales',
color='Region', size='Volume')) +\
geom_point() +\
facet_wrap('Region') +\
coord_flip()
<ggplot: (149468624511)>
绘制xkcd(一种科学漫画)风格
from plotnine.themes import *
ggplot(data, aes(x='Price', y='Sales',
color='Region', size='Volume')) +\
geom_point() +\
facet_wrap('Region') +\
theme_xkcd()
<ggplot: (149466845154)>
Data | 绘制所用数据(DataFrame) |
Aesthetics | 数据映射为图像属性 |
Geometries | 用来表示数据的几何形状 |
Facets | 对数据进行分组并绘制子图 |
Statistics | 通过统计运算得到新数据 |
Coordinates | 变换数据绘制的空间 |
Theme | 对所有非数据元素进行定制 |
“+” | 实现不同图层的叠加 |
ggplot2
包matplotlib
pandas
配合良好上帝的归上帝, 凯撒的归凯撒
%matplotlib inline
import plotnine as p9
import pandas as pd
#导入plotnine包的绘图函数
from plotnine import *
#导入plotnine自带的数据集
from plotnine.data import *
p9.options.figure_size = (9, 4.5)
surveys_complete = pd.read_csv('../data/surveys.csv')
surveys_complete = surveys_complete.dropna()
(p9.ggplot(data=surveys_complete))
<ggplot: (149464795844)>
aes()
建立数据变量与图中元素的映射(p9.ggplot(data=surveys_complete,
mapping=p9.aes(x='weight', y='hindfoot_length')))
<ggplot: (149462844013)>
aes()函数中常见的映射选项是:
ggplot
对象# Create
surveys_plot = p9.ggplot(
data=surveys_complete,
mapping=p9.aes(x='weight', y='hindfoot_length'))
# Draw the plot
surveys_plot + p9.geom_point()
<ggplot: (149469674406)>
surveys_complete
的plot-id
列创建柱状图stat
参数确定柱状图的高度:(p9.ggplot(data=surveys_complete,
mapping=p9.aes(x='plot_id'))
+ p9.geom_bar(stat='count')
)
<ggplot: (149468567227)>
(p9.ggplot(data=surveys_complete,
mapping=p9.aes(x='plot_id', y='weight'))
+ p9.geom_bar(stat='identity')
)
<ggplot: (149466039016)>
import matplotlib.pyplot as plt
plt.style.use('ggplot')
surveys_complete.groupby('plot_id').sum().plot(kind='bar', y='weight', figsize=(10,7))
plt.ylabel("weight")
plt.show()
ggplot()
中的参数是全局的,可以被所有geom
层看到geom
,可以单独设置aes()
plotnine
作图是迭代过程data
、aes
、geom
是基本元素(p9.ggplot(data=surveys_complete,
mapping=p9.aes(x='weight',
y='hindfoot_length'))
+ p9.geom_point()
)
<ggplot: (149468412805)>
# 调整透明度
(p9.ggplot(data=surveys_complete,
mapping=p9.aes(x='weight', y='hindfoot_length'))
+ p9.geom_point(alpha=0.1)
)
<ggplot: (149460941948)>
# 设定所有点的颜色
(p9.ggplot(data=surveys_complete,
mapping=p9.aes(x='weight', y='hindfoot_length'))
+ p9.geom_point(alpha=0.1, color='green')
)
<ggplot: (149463554121)>
# 将`species_id`映射到颜色
(p9.ggplot(data=surveys_complete,
mapping=p9.aes(x='weight',
y='hindfoot_length'))
+ p9.geom_point(alpha=0.1, mapping=aes(color='species_id'))
)
<ggplot: (149462884220)>
可以通过函数labs(name=value)
来指定图形的标题(title),子标题(subtitle),坐标轴的标签(x,y)等,并可以指定标签的美学选项:
函数 | 说明 |
---|---|
scale_x_log10() | x轴以log10的格式设定 |
scale_x_reverse() | 将x坐标轴反转至y坐标轴 |
scale_x_sqrt() | 将将x轴以sqrrt的格式设 |
函数 | 说明 |
---|---|
scale_*_continuous() | 将连续型数值映射 |
scale_*_discrete() | 将离散型数值映射 |
scale_*_identity() | 将时间型数值映射 |
scale_*_manual(values = ()) | 自定义将离散型数值映射 |
scale_*_date(date_labels = "%m/%d"), date_breaks = "2 weeks") |
将数据设定为时间型 |
scale_*_datetime() | 将x轴数据设定为时间型 |
p9.options.figure_size = (9, 6)
# 改变X轴坐标
(p9.ggplot(data=surveys_complete,
mapping=p9.aes(x='weight',
y='hindfoot_length'))
+ p9.geom_point(alpha=0.1,
mapping=aes(color='species_id'))
+ p9.xlab("Weight (g)")
)
<ggplot: (149460848457)>
# 采用对数坐标轴
(p9.ggplot(data=surveys_complete,
mapping=p9.aes(x='weight',
y='hindfoot_length'))
+ p9.geom_point(alpha=0.1, mapping=aes(color='species_id'))
+ p9.scale_x_log10()
)
<ggplot: (149469660657)>
在前面柱状图的基础上,在柱子内按照性别的比例填充两种不同的颜色。
(p9.ggplot(data=surveys_complete,
mapping=p9.aes(x='plot_id',
fill='sex'))
+ p9.geom_bar()
+ p9.scale_fill_manual(["blue", "orange"])
)
<ggplot: (149460700651)>
对于每一种
species_id
,查看weight
的分布
(p9.ggplot(data=surveys_complete,
mapping=p9.aes(x='species_id',
y='weight'))
+ p9.geom_boxplot()
)
<ggplot: (149469665259)>
(p9.ggplot(data=surveys_complete,
mapping=p9.aes(x='species_id',
y='weight'))
+ p9.geom_jitter(alpha=0.2) # 消除点的重合
+ p9.geom_boxplot(alpha=0, outlier_color = "red")
)
<ggplot: (149463529707)>
按照
species_id
绘制weight
的概率密度图
(p9.ggplot(data=surveys_complete,
mapping=p9.aes(x='weight',
fill='species_id'
))
+ p9.geom_density(alpha=0.2)
)
<ggplot: (149463630574)>
(p9.ggplot(data=surveys_complete,
mapping=p9.aes(x='species_id',
y='weight',
color='factor(plot_id)'))
+ p9.geom_jitter(alpha=0.3)
+ p9.geom_violin(alpha=0, color="0.7")
+ p9.scale_y_log10()
)
<ggplot: (149463691102)>
对于每一个species_id
统计每年的数目,并绘图。
# 按照 species_id和year进行聚合
yearly_counts = surveys_complete.groupby(['year', 'species_id'])['species_id'].count()
# 重置索引
yearly_counts = yearly_counts.reset_index(name='counts')
yearly_counts.head()
year | species_id | counts | |
---|---|---|---|
0 | 1977 | DM | 181 |
1 | 1977 | DO | 12 |
2 | 1977 | DS | 29 |
3 | 1977 | OL | 1 |
4 | 1977 | OX | 2 |
(p9.ggplot(data=yearly_counts,
mapping=p9.aes(x='year',
y='counts',
color='species_id'))
+ p9.geom_line()
)
<ggplot: (149460896193)>
两种方式,分别使用 facet_wrap 或 facet_grid 函数。
# 基于前面的例子
(p9.ggplot(data=surveys_complete,
mapping=p9.aes(x='weight',
y='hindfoot_length',
color='species_id'))
+ p9.geom_point(alpha=0.1)
)
<ggplot: (149466015091)>
# 按照性别分为两个子图
(p9.ggplot(data=surveys_complete,
mapping=p9.aes(x='weight',
y='hindfoot_length',
color='species_id'))
+ p9.geom_point(alpha=0.1)
+ p9.facet_wrap("sex")
)
<ggplot: (149468409004)>
(p9.ggplot(data=surveys_complete,
mapping=p9.aes(x='weight',
y='hindfoot_length',
color='species_id'))
+ p9.geom_point(alpha=0.1)
+ p9.facet_wrap("plot_id")
)
<ggplot: (149461355819)>
# only select the years of interest
survey_2000 = surveys_complete[surveys_complete["year"].isin([2000, 2001])]
(p9.ggplot(data=survey_2000,
mapping=p9.aes(x='weight',
y='hindfoot_length',
color='species_id'))
+ p9.geom_point(alpha=0.1)
+ p9.facet_grid("year ~ sex")
)
<ggplot: (149466095165)>
按照性别,绘制子图展示平均weight
随时间的变化
yearly_weight = surveys_complete.groupby(['year', 'sex'])['weight'].mean().reset_index()
yearly_weight = surveys_complete.groupby(['year', 'sex'])['weight'].mean().reset_index()
(p9.ggplot(data=yearly_weight,
mapping=p9.aes(x='year',
y='weight'))
+ p9.geom_line()
+ p9.facet_wrap("sex")
)
<ggplot: (149463597692)>
在上图基础上,比较不同species_id
的变化趋势。
yearly_weight = surveys_complete.groupby(['year', 'species_id', 'sex'])['weight'].mean().reset_index()
(p9.ggplot(data=yearly_weight, mapping=p9.aes(x='year', y='weight', color='species_id')) + p9.geom_line() + p9.facet_wrap('sex') )
<ggplot: (149462516104)>
year
作为分类变量(categorical),统计各年数目(p9.ggplot(data=surveys_complete,
mapping=p9.aes(x='factor(year)'))
+ p9.geom_bar()
)
<ggplot: (149464795784)>
(p9.ggplot(data=surveys_complete,
mapping=p9.aes(x='factor(year)'))
+ p9.geom_bar()
+ p9.theme_bw()
+ p9.theme(axis_text_x = p9.element_text(angle=90))
)
<ggplot: (149463144060)>
my_custom_theme = p9.theme(axis_text_x = p9.element_text(color="grey", size=10,
angle=90, hjust=.5),
axis_text_y = p9.element_text(color="grey", size=10))
(p9.ggplot(data=surveys_complete,
mapping=p9.aes(x='factor(year)'))
+ p9.geom_bar()
+ my_custom_theme
)
<ggplot: (149463541435)>
my_plot = (p9.ggplot(data=surveys_complete,
mapping=p9.aes(x='weight', y='hindfoot_length'))
+ p9.geom_point()
)
my_plot.save("scatterplot.png", width=4, height=2, dpi=300)
from PIL import Image
im = Image.open('scatterplot.png')
im
C:\ProgramData\Anaconda3\envs\study\lib\site-packages\plotnine\ggplot.py:727: PlotnineWarning: Saving 4 x 2 in image. C:\ProgramData\Anaconda3\envs\study\lib\site-packages\plotnine\ggplot.py:730: PlotnineWarning: Filename: scatterplot.png
data
指定数据集geom
surveys_plot = p9.qplot(x=surveys_complete['weight'], y=surveys_complete['hindfoot_length'])
surveys_plot
<ggplot: (149460691288)>
surveys_plot = p9.qplot(data=surveys_complete,
x='weight', y='hindfoot_length')
surveys_plot
<ggplot: (149460999503)>
surveys_plot = p9.qplot(data=surveys_complete,
x='weight', y='hindfoot_length',
color='weight')
surveys_plot
<ggplot: (149463573194)>
surveys_plot = p9.qplot(data=surveys_complete,
x='weight', y='hindfoot_length',
geom = ["point", "bin2d"])
surveys_plot
<ggplot: (149463874130)>
不足: