EMATM0061: Statistical Computing and Empirical

Assignment 2

EMATM0061: Statistical Computing and Empirical Methods, TB1, 2024

Introduction

Create an R Markdown for assignment

First, it is recommended that you create a single R Markdown document to include   your solutions, with headings created by heading codes such as “## 1.1 (Q1)”, “## 3 (Q1)”, etc.

It is a good practice to use R Markdown to organise your code and results. You can start with the template called Assignment02_TEMPLATE.Rmd which can be downloaded via Blackboard.

In Section 1, you will need to use R programming to complete the tasks. In section 2 and 3, it is not required to write R code.

You can optionally hand in this assignment by 13:00 Tuesday 1 October. This will   help us understand your work but will not count towards your final grade. If you    want to hand in the assignment, please submit a PDF file containing your answers  (click on the “Assignment 02” under the assignment tab at Blackboards to upload the file). There is no requirement on how the PDF file is generated. One example is to choose the output of R-markdown as PDF (which may require LaTex to be installed in your computer). Another example is to choose a html output at R-markdown and convert the html file into a PDF file. If you have multiple PDF files, please combine them into a single PDF file before the submission.

Load packages

Then we need to load two packages, namely Stat2Data and tidyverse, before

answering the questions. If they haven’t been installed in your computer, please use install.packages() to install them first.

1.      Load the tidyverse package:

library(tidyverse)

2.      Load the Stat2Data package and then the dataset Hawks:

library(Stat2Data)

data("Hawks")

1. Data Wrangling

This part is mainly about data wrangling. Basic concepts of data wrangling can be found in lecture 4.

1.1 Select and filter

(Q1). Use acombination of the select() and filter() functions to generate a data

frame. called “hSF” which is a sub-table of the original Hawks data frame, such that

1.     Your data frame. should include the columns:

a)     “Wing”

b)     “Weight”

c)     “Tail”

2.     Your data frame. should contain a row for every hawk such that:

a)     They belong to the species of Red-Tailed hawks

b)     They have weight at least 1kg.

3.     Use the pipe operator “%>%” to simplify your code. The data frame. should look like this:

## Wing Weight Tail

## 1 412 1090 230

## 2 412 1210 210

## 3 405 1120 238

## 4 393 1010 222

## 5 371 1010 217

(Q2) How many variables does the data frame. hSF have? What would you say to communicate this information to a Machine Learning practitioner?

How many examples does the data frame. hSF have? How many observations? How many cases?

1.2 The arrange function

(Q1) Use the arrange() function to sort the hSF data frame. created in the previous section so that the rows appear in order of increasing wingspan.

Then use the head command to printout the top five rows of your sorted data frame. Your results should look something like this:

## Wing Weight Tail

## 1 37.2 1180 210

## 2 111.0 1340 226

## 3 199.0 1290 222

## 4 241.0 1320 235

## 5 262.0 1020 200

1.3 Join and rename functions

The species of Hawks within the data frame. “Hawks” have been indicated via a two- letter code (e.g., RT, CH, SS). The correspondence between these codes and the full names is given by the following data frame.

##   species_code species_name_full

## 1           CH          Cooper's

## 2           RT        Red-tailed

## 3           SS     Sharp-shinned

(Q1). Use data.frame() to create a data frame. that is called

hawkSpeciesNameCodes and is the same as the above data frame. (i.e., containing the correspondence between codes and the full species names).

(Q2). Use a combination of the functions left_join(), therename() and the select() functions to create a new data frame. called “hawksFullName” which is the same as   the “Hawks” data frame. except that the Species column contains the full names rather than the two-letter codes.

(Q3). Use acombination of the head() and select() functions to printout the top seven rows of the columns “Species”, “Wing” and “Weight” of the data frame. called hawksFullName. Do this without modifying the data frame. you just created. Your result should something like this:

##

Species

Wing

Weight

## 1

Red-tailed

385

920

## 2

Red-tailed

376

930

## 3

Red-tailed

381

990

## 4

Cooper's

265

470

## 5

Sharp-shinned

205

170

## 6

Red-tailed

412

1090

## 7

Red-tailed

370

960

Does it matter what type of join function you use here? In what situations would it make a difference?

1.4 The mutate function

Suppose that the fictitious “Healthy Hawks Society”has proposed a new measure called the “bird BMI” which attempts to measure the mass of a hawk standardized by their wingspan. The “bird BMI” is equal to the weight of the hawk (in grams) divided by their wingspan (in millimeters) squared. That is,

Bird-BMI : = 1000 × Weight/Wing-pan2 .

(Q1). Use the mutate()select() and arrange() functions to create a new data frame. called “hawksWithBMI” which has the same number of rows as the original Hawks data frame. but only two columns - one with their Species and one with their   “bird BMI”. Also, arrange the rows in descending order of “bird BMI”. The top 8 rows of your data frame. should look something like this:

## Species bird_BMI

## 1 RT 852.69973

## 2 RT 108.75741

## 3 RT 32.57493

## 4 RT 22.72688

## 5 CH 22.40818

## 6 RT 19.54932

## 7 CH 15.21998

## 8 RT 14.85927

1.5 Summarize and group-by functions

Using the data frame. “hawksFullName”, from Section 1.3 above, to do the following tasks:

(Q1). In combination with the summarize() and the group_by functions, create a summary table, broken down by Hawk species, which contains the following summary quantities:

1.     The number of rows (num_rows);

2.     The average wingspan in centimeters (mn_wing);

3.     The median wingspan in centimeters (nd_wing);

4.     The trimmed average wingspan in centimeters with trim=0.1, i.e., the mean of the numbers after the 10% largest and the 10% smallest values being

removed (t_mn_wing);

5.     The biggest ratio between wingspan and tail length (b_wt_ratio).

Hint: type?summarize to see a list of useful functions (mean, sum, etc) that can be used to compute the summary quantities. Your final result should look something  like this:

## # A tibble: 3 × 6

##   Species       num_rows mn_wing md_wing t_mn_wing b_wt_ratio

##                                

## 1 Cooper's            70    244.     240      243.       1.67

## 2 Red-tailed         577    383.     384      385.       3.16

## 3 Sharp-shinned      261    185.     191      184.       1.67

(Q2). Next create a summary table of the following form. Your summary table will    show the number of missing values, broken down by species, for the columns Wing, Weight, Culmen, Hallux, Tail, StandardTail, Tarsus, and Crop. You can complete this task by combining the select()group_by()summarize()across()everything()sum() and is.na() functions. You should end with a summary table of the following  form.:

## # A tibble: 3 × 9

##   Species        Wing Weight Culmen Hallux  Tail StandardTail Tarsus

Crop

##                             

## 1 Cooper's          1      0      0      0     0           19     62

21

## 2 Red-tailed        0      5      4      3     0          250    538

254

## 3 Sharp-shinned     0      5      3      3     0           68    233

68

2. Random experimentsevents and sample spaces, and the set theory

In this exercise, we will learn about random experiments, events and sample spaces and set theory that were introduced in Lecture 5.

In this section, you are not required to compute your results using R codes. If you want to write math formulas in R-markdown, the document called “Assignment_R    MarkdownMathformulasandSymbolsExamples.rmd” (available under the “resource list” tab at Blackboard course webpage) provides a list of examples for your reference.

2.1 Random experiments, events and sample spaces

(Q1) Firstly, write down the definition of a random experiment, event and sample space. This question aims to help you recall the basic concepts before completing   the subsequent tasks.

(Q2) Consider a random experiment of rolling a dice twice. Give an example of what  is an event in this random experiment. Also, can you write down the sample space as a set? What is the total number of different events in this experiment? Is the empty    set considered as an event?

2.2 Set theory

Remember that a set is just a collection of objects. All that matters for the identity of a set is the objects it contains. In particular, the elements within the set are unordered, so for example the set {1, 2, 3} is exactly the same as the set {3, 2, 1}. In addition, since sets are just collections of objects, each object can only be either included or excluded and multiplicities do not change the nature of the set. In particular, the set {1, 2, 2, 2, 3, 3} is exactly the same as the set A = {1, 2, 3}. In   general there is no concept of “position” within a set, unlike a vector or matrix.

(Q1) Set operations:

Let the sets A, B, C be defined by A := {1, 2, 3}, B := {2, 4, 6}, C := {4, 5, 6}.

1.     What are the unions A ∪ B and A ∪ C?

2.     What are the intersections A ∩ B and A ∩ C?

3.     What are the complements A ∖ B and A ∖ C?

4.     AreA and B disjoint? AreA and C disjoint?

5.     Are B and A ∖ B disjoint?

6.     Write down an arbitrary partition of {1,2,3,4,5,6} consisting of two sets. Also, write down another partition of {1,2,3,4,5,6} consisting of three sets.

(Q2) Complements, subsets and De Morgan’s laws

Let Ω be a sample space. Recall that for an event A  ⊆ Ω the complement Ac  : = Ω ∖ A : = {w  ∈ Ω:w ∉ A}. Take a pair of events A  ⊆ Ω and B  ⊆ Ω .

1.      Can you give an expression for (Ac )c  without using the notion of a complement?

2.     What is Ωc?

3.      (Subsets) Show that if A ⊆ B, then Bc   ⊆ Ac.

4.      (De Morgan’s laws) Show that (A ∩ B)c   = Ac  ∪ Bc. Let’s suppose we have a sequence of events A1, A2, ⋯ , Ak   ⊆ Ω . Can you write out an expression for (∩k(k)= 1 Ak )c?

5.      (De Morgan’s laws) Show that (A ∪ B)c   = Ac  ∩ Bc.

6.     Let’s suppose we have a sequence of events A1, A2, ⋯ , Ak   ⊆ Ω . Can you write out an expression for (∪k(k)= 1 Ak )c?

(Q3) Cardinality and the set of all subsets:

Suppose that Ω = {w1, w2, ⋯ , wk } contains K elements for some natural number K. Here Ω has cardinality K.

Let E be aset of all subsets of Ω, i.e., E : = {A|A ⊂ Ω}. Note that here E is a set. Give a formula for the cardinality of E in terms of K.

(Q4) Disjointness and partitions.

Suppose we have a sample space Ω, and events A1, A2, A3, A4  are subsets of Ω .

1.      Can you think of a set which is disjoint from every other set? That is, find a set A ⊆ Ω such that A ∩ B  = ∅ for all B ⊆ Ω .

2.      Define events S1  : = A1, S2   = A2  ∖ A1, S3   = A3  ∖ (A1  ∪ A2), S4   = A4  ∖

(A1  ∪ A2 ∪ A3). Show that S1, S2, S3, S4  form. a partition of A1  ∪ A2  ∪ A3  ∪ A4 . (Q5) Indicator function.

Suppose we have a sample space Ω, and the event A is a subset of Ω. Let 1A  be the indicator function of A.

1.     Write down the indicator function 1Acof Ac  (use 1A in your formula).

2.      Can you find a set B whose indicator function is 1Ac   + 1A?

3.      Recall that 1A∩B   = 1A  ⋅ 1B  and 1A∪B   = max(1A, 1B ) = 1A  + 1B  − 1A  ⋅ 1B  for

any A ⊆ Ω and B ⊆ Ω . Combining this with the conclusion from Question  (Q5) 1, use indicator functions to prove (A ∩ B)c   = Ac  ∪ Bc  (De Morgan’s laws).

(Q6) Uncountable infinities (this is an optional extra).

This is a challenging optional extra. You may want to return to this question once you have completed all other questions.

Show that the set of numbers Ω : = [0, 1] is uncountably infinite.

<think>我们正在讨论PCL库中的StatisticalOutlierRemoval滤波器。根据引用[1][2][3][5],这是一种基于统计学的离群点去除方法,适用于去除点云中的稀疏噪声点(如雨滴、灰尘等)。其核心思想是分析每个点与其邻域点的距离分布,利用均值和标准差来设定阈值,从而剔除离群点。 关键参数: - `setMeanK(int k)`: 设置每个点要分析的邻域点数(即计算邻域距离时考虑的最近邻点数量) - `setStddevMulThresh(double thresh)`: 设置标准差乘数阈值(通常设为1.0或2.0,值越大,过滤掉的点越少) 工作流程(引用[3]): 1. 对于点云中的每个点,计算其到最近的k个邻域点的平均距离。 2. 计算所有点的平均距离的均值μ和标准差σ。 3. 设定阈值:$ \text{threshold} = \mu + \alpha \cdot \sigma $,其中α为`setStddevMulThresh`设置的乘数。 4. 若某点的邻域平均距离大于阈值,则判定为离群点并移除。 注意:由于方差是距离的平方(引用[4]),所以标准差σ的单位与距离一致。 下面是一个使用PCL库中StatisticalOutlierRemoval滤波器的C++示例代码:</think>### PCL 中 StatisticalOutlierRemoval 滤波器的使用详解 StatisticalOutlierRemoval 是 PCL(点云库)中一种基于统计学的离群点去除方法,特别适用于去除点云中稀疏的异常点(如传感器噪声、雨滴、灰尘等)[^1][^2]。其核心原理是通过分析每个点的邻域距离分布,利用统计学方法剔除不符合分布的离群点[^3][^5]。 #### 核心算法原理 1. **邻域距离计算**: 对于每个点 $p_i$,计算其到最近 $k$ 个邻域点的平均距离 $\mu_i$ $$ \mu_i = \frac{1}{k} \sum_{j=1}^{k} \| p_i - p_j \| $$ 2. **全局统计计算**: 计算所有点平均距离的均值 $\mu$ 和标准差 $\sigma$ $$ \mu = \frac{1}{N} \sum_{i=1}^{N} \mu_i, \quad \sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (\mu_i - \mu)^2} $$ 3. **离群点判定**: 设定阈值 $\tau = \mu + \alpha \cdot \sigma$,当 $\mu_i > \tau$ 时判定为离群点($\alpha$ 为阈值乘数)[^3][^5] #### 使用步骤(C++ 示例) ```cpp #include <pcl/point_types.h> #include <pcl/filters/statistical_outlier_removal.h> int main() { // 1. 创建输入点云 pcl::PointCloud<pcl::PointXYZ>::Ptr cloud(new pcl::PointCloud<pcl::PointXYZ>); pcl::PointCloud<pcl::PointXYZ>::Ptr filteredCloud(new pcl::PointCloud<pcl::PointXYZ>); // 2. 填充点云数据(实际应用需从文件读取) // cloud = ... // 3. 创建滤波器对象 pcl::StatisticalOutlierRemoval<pcl::PointXYZ> sor; // 4. 设置参数 sor.setInputCloud(cloud); // 输入点云 sor.setMeanK(50); // 邻域点数(建议值:30-100) sor.setStddevMulThresh(1.0); // 标准差乘数阈值(建议值:1.0-2.0) sor.setNegative(false); // false=保留内点,true=保留离群点 // 5. 执行滤波 sor.filter(*filteredCloud); // 6. 保存结果 pcl::io::savePCDFile("filtered.pcd", *filteredCloud); return 0; } ``` #### 关键参数说明 | 参数 | 说明 | 推荐值 | |------|------|--------| | `setMeanK()` | 邻域点数 $k$ | 30-100(点数越多计算越精确但耗时) | | `setStddevMulThresh()` | 阈值乘数 $\alpha$ | 1.0-2.0(值越小过滤越严格) | | `setNegative()` | 输出模式 | `false`=输出内点(默认),`true`=输出离群点 | #### 实际应用建议 1. **参数调优**: - 初始值:$\alpha=1.5, k=50$ - 通过可视化工具(如CloudCompare)观察效果调整参数 - 点云密度越高,$k$ 值可适当增大 2. **性能优化**: ```cpp sor.setKeepOrganized(false); // 非有序点云可提升速度 ``` 3. **多阶段处理**: ```cpp // 先进行体素滤波降采样 pcl::VoxelGrid<pcl::PointXYZ> vg; vg.setInputCloud(cloud); vg.setLeafSize(0.01f, 0.01f, 0.01f); // 10mm体素 vg.filter(*downsampledCloud); // 再进行统计滤波 sor.setInputCloud(downsampledCloud); ``` #### 适用场景 - 激光雷达/深度相机采集的原始点云去噪 - 三维重建前的数据预处理 - 工业检测中的异常点去除 - 点云配准前的数据清洗 > **注意**:此方法主要针对**稀疏离群点**,对于密集噪声(如高斯噪声)建议结合其他滤波器(如RadiusOutlierRemoval)使用[^1][^5]。
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值