0. 统计学概念

Created2024-09-08|Updated2024-09-09|math

|Post Views:

基本概念

population: 一个实验中所有数据的综合
sample: population 的一个子集，包含了一些已经被观察过的元素
outlier: 一个样本中与其他数据相差较大的数据

简单随机样本 SRS

尺寸为 n 是一个每个元素都相同可能性下被挑选出来的sample

常见计算公式

样本均值 sample mean

即样本的平均值

\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}

偏差 deviation

一个样本值到平均值的距离

x_i - \bar{x}

样本方差 sample variance

样本值到平均值的距离的平方和

s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}

为什么是 n - 1?

slides 上说实测效果更好，用 n 作为分母会 underestimate the population variance

样本标准差 sample standard deviation

样本方差的平方根

s = \sqrt{s^2} = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}}

中位数 sample median

很多时候如果存在 outlier，用中位数会更好（基本不受 outlier 影响）
将所有数据按大小排序，取中间的值

如果 n 为奇数，取第 (n+1)/2 个值
如果 n 为偶数，取第 n/2 和 n/2 + 1 个值的平均值

四分位数 quartiles

将数据分成四等分，Q1, Q2, Q3

Q1: $0.25(n+1)$ 位置的值
Q2: 中位数
Q3: $0.75(n+1)$ 位置的值

四分位距 interquartile range

IQR:=Q3-Q1

这是一种利用四分位数来计算样本离散度的方法

稳定性 robust

一个统计数据如果不受到 outlier 的影响，称之为 robust

median 会比 mean 更加 robust 因为其只会用到 center 的数据

图片

茎叶图 stem-leaf plot

数据分布的描述

Shape:
1. mode(极值): 单峰 unimodal, 双峰 bimodal, 多峰 multimodal
2. 对称性: 对称 symmetric, 左偏 skewed left, 右偏 skewed right

Author: Yuchen You (Wesley)

Link: http://example.com/2024/09/07/wesley_knowledge_repo/math/possibility_statistics_stats412/0.%20%E7%BB%9F%E8%AE%A1%E5%AD%A6%E6%A6%82%E5%BF%B5/

Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.

Related Articles

3. 随机变量与概率分布

4. 复合变量的概率空间

5. 随机变量的变换

7. 从切比雪夫不等式到中心极限定理

8. 点预测 point estimation

9. 假设检验 hypothesis test

Loading the Database