## 一. Building a Spam Classifier

• 尽可能的扩大数据样本：Honypot 做了这样一件事，把自己包装成一个对黑客极具吸引力的机器，来诱使黑客进行攻击，就像蜜罐（honey pot）吸引密封那样，从而记录攻击行为和手段。
• 添加更多特征：例如我们可以增加邮件的发送者邮箱作为特征，可以增加标点符号作为特征（垃圾邮件总会充斥了？，！等吸引眼球的标点）。
• 预处理样本：正如我们在垃圾邮件看到的，道高一尺，魔高一丈，垃圾邮件的制造者也会升级自己的攻击手段，如在单词拼写上做手脚来防止邮件内容被看出问题，例如把 medicine 拼写为 med1cinie 等。因此，我们就要有手段来识别这些错误拼写，从而优化我们输入到逻辑回归中的样本。

1.建立一个简单的机器学习系统，用简单的算法快速实现它。

2.通过画出学习曲线，以及检验误差，来找出我们的算法是否存在高偏差或者高方差的问题，然后再通过假如更多的训练数据、特征变量等等来完善算法。

3.误差分析。例如在构建垃圾邮件分类器，我们检查哪一类型的邮件或者那些特征值总是导致邮件被错误分类，从而去纠正它。当然，误差的度量值也是很重要的，例如我们可以将错误率表示出来，用来判断算法的优劣。

## 二. Handling Skewed Data

$Precision=\frac{True;positive}{Predicated;as;positive }=\frac{True;positive}{True;positive+False;positive}$

$Recall=\frac{True;positive}{Actual;positive}=\frac{True;positive}{True;positive+False;negative}$

$F_1;Score = 2\frac{PR}{P+R}$

P 指的是 Precision，R 指的是 Recall。

## 三. Using Large Data Sets

It's not who has the best algorithm that wins. It's who has the most data.

## 四. Machine Learning System Design 测试

### 1. Question 1

You are working on a spam classification system using regularized logistic regression. "Spam" is a positive class (y = 1) and "not spam" is the negative class (y = 0). You have trained your classifier and there are m = 1000 examples in the cross-validation set. The chart of predicted class vs. actual class is:

Actual Class: 1 Actual Class: 0
Predicted Class: 1 85 890
Predicted Class: 0 15 10

For reference:

• Accuracy = (true positives + true negatives) / (total examples)
• Precision = (true positives) / (true positives + false positives)
• Recall = (true positives) / (true positives + false negatives)
• F1 score = (2 * precision * recall) / (precision + recall)

What is the classifier's F1 score (as a value from 0 to 1)?

Enter your answer in the box below. If necessary, provide at least two values after the decimal point.

### 2. Question 2

Suppose a massive dataset is available for training a learning algorithm. Training on a lot of data is likely to give good performance when two of the following conditions hold true.

Which are the two?

A. When we are willing to include high order polynomial features of x (such as $x_{1}^{2}$, $x_{2}^{2}$,$x_{1}$,$x_{2}$, etc.).

B. The features x contain sufficient information to predict y accurately. (For example, one way to verify this is if a human expert on the domain can confidently predict y when given only x).

C. We train a learning algorithm with a small number of parameters (that is thus unlikely to overfit).

D. We train a learning algorithm with a large number of parameters (that is able to learn/represent fairly complex functions).

A. 需要的是足够的特征量而不是高阶。
B. 特征量有足够的信息来准确预测。
C. 少量的特征量显然是不行的。
D. 要有足够多的变量（特征量）。

### 3. Question 3

Suppose you have trained a logistic regression classifier which is outputing hθ(x).

Currently, you predict 1 if $h_{\theta}(x)\geqslant threshold$, and predict 0 if $h_{\theta}(x)<threshold$, where currently the threshold is set to 0.5.

Suppose you decrease the threshold to 0.3. Which of the following are true? Check all that apply.

A. The classifier is likely to have unchanged precision and recall, but higher accuracy.

B. The classifier is likely to now have higher precision.

C. The classifier is likely to now have higher recall.

D. The classifier is likely to have unchanged precision and recall, but lower accuracy.

### 4. Question 4

Suppose you are working on a spam classifier, where spam emails are positive examples (y=1) and non-spam emails are negative examples (y=0). You have a training set of emails in which 99% of the emails are non-spam and the other 1% is spam. Which of the following statements are true? Check all that apply.

A. If you always predict non-spam (output y=0), your classifier will have 99% accuracy on the training set, but it will do much worse on the cross validation set because it has overfit the training data.

B. If you always predict non-spam (output y=0), your classifier will have 99% accuracy on the training set, and it will likely perform similarly on the cross validation set.

C. A good classifier should have both a high precision and high recall on the cross validation set.

D. If you always predict non-spam (output y=0), your classifier will have an accuracy of 99%.

A. 在交叉验证集因为过拟合的问题会使准确率下降，这不是过拟合的问题，是偏斜类的问题。
B. 假如训练集有99%准确率，那么交叉验证集也有很大可能有99%的准确率，这是正确的，因为数据是随机分布的，训练集的数据分布跟交叉验证集的数据分布相似。
C. 一个好的分类器应该查准率和召回率都比较高，正确。
D. 假如我们都把结果设为全为非垃圾邮件，那么准确率将达到99%，正确。

### 5. Question 5

Which of the following statements are true? Check all that apply.

A. On skewed datasets (e.g., when there are more positive examples than negative examples), accuracy is not a good measure of performance and you should instead use F1 score based on the precision and recall.

B. If your model is underfitting the training set, then obtaining more data is likely to help.

C. After training a logistic regression classifier, you must use 0.5 as your threshold for predicting whether an example is positive or negative.

D. It is a good idea to spend a lot of time collecting a large amount of data before building your first version of a learning algorithm.

E. Using a very large training set makes it unlikely for model to overfit the training data.

A.利用 F1 score 去衡量准确性，正确。
B.模型不适合训练集，是欠拟合，欠拟合增大数据样本没用。
C.阈值不一定是0.5。
D.在建立第一个学习算法前花大量时间收集数据显然有可能走向浪费时间的不归路。
E.用更多的数据样本可以解决过拟合的现象，正确。

GitHub Repo：Halfrost-Field

Follow: halfrost · GitHub

Previous Post

Next Post

### 初探支持向量机

You've successfully subscribed to Halfrost's Field | 冰霜之地
Great! Next, complete checkout for full access to Halfrost's Field | 冰霜之地
Welcome back! You've successfully signed in.