10 Essential Machine Learning Interview Questions *

最优秀的机器学习工程师可以回答的基本问题. 在我们社区的推动下，我们鼓励专家提交问题并提供反馈.

Hire a Top Machine Learning Engineer Now

是顶级自由软件开发人员的专属网络吗, designers, finance experts, product managers, and project managers in the world. 顶级公司雇佣Toptal自由职业者来完成他们最重要的项目.

Interview Questions

什么是分层交叉验证，我们什么时候应该使用它?

View answer

交叉验证是一种在训练集和验证集之间划分数据的技术. On typical cross-validation this split is done randomly. But in stratified cross-validation, 分割保留了训练和验证数据集上类别的比例.

For example, 如果我们的数据集有10%的a类和90%的B类, and we use stratified cross-validation, 我们将在培训和验证中有相同的比例. In contrast, if we use simple cross-validation, 在最坏的情况下，我们可能会发现在验证集中没有类别A的样本.

分层交叉验证可应用于以下情况:

On a dataset with multiple categories. 数据集越小，分类就越不平衡, 使用分层交叉验证就越重要.
On a dataset with data of different distributions. For example, in a dataset for autonomous driving, we may have images taken during the day and at night. 如果我们不确保这两种类型都出现在培训和验证中, we will have generalization problems.

为什么整体模型通常比单个模型得分更高?

View answer

集成是多个模型的组合，以创建单个预测. 做出更好预测的关键思想是，模型应该犯不同的错误. 这样，一个模型的错误将被其他模型的正确猜测所补偿，因此整体的得分将更高.

We need diverse models for creating an ensemble. Diversity can be achieved by:

Using different ML algorithms. 例如，您可以组合逻辑回归、k近邻和决策树.
Using different subsets of the data for training. This is called bagging.
为训练集的每个样本赋予不同的权重. If this is done iteratively, 根据集合误差对样本进行加权, it’s called boosting.

许多在数据科学竞赛中获胜的解决方案都是集成的. However, in real-life machine learning projects, 工程师需要在执行时间和准确性之间找到平衡.

What is regularization? Can you give some examples of regularization techniques?

View answer

正则化是任何旨在提高验证分数的技术, sometimes at the cost of reducing the training score.

Some regularization techniques:

L1 试图使模型参数的绝对值最小. It produces sparse parameters.
L2 尝试最小化模型参数的平方值. It produces parameters with small values.
Dropout 是否有一种技术应用于神经网络，在训练过程中随机设置一些神经元的输出为零. 这迫使网络通过防止神经元之间复杂的相互作用来学习更好的数据表示:每个神经元都需要学习有用的特征.
Early stopping 当验证分数停止提高时，是否会停止训练, even when the training score may be improving. This prevents overfitting on the training dataset.

Apply to Join Toptal's Development Network

and enjoy reliable, steady, remote Freelance Machine Learning Engineer Jobs

Apply as a Freelancer

What is an imbalanced dataset? Can you list some ways to deal with it?

View answer

不平衡数据集是具有不同比例的目标类别的数据集. For example, 例如，我们必须检测某些疾病的医学图像数据集通常会有比阳性样本更多的阴性样本, 98%的图像没有病变，2%的图像有病变.

有不同的选择来处理不平衡的数据集:

Oversampling or undersampling. 而不是用训练数据集中的均匀分布抽样, 我们可以使用其他分布，这样模型就能看到一个更平衡的数据集.
Data augmentation. 我们可以通过以一种可控的方式修改现有数据，在不太频繁的类别中添加数据. In the example dataset, we could flip the images with illnesses, 或者在图像的副本上添加噪音，使疾病仍然可见.
Using appropriate metrics. In the example dataset, if we had a model that always made negative predictions, it would achieve a precision of 98%. There are other metrics such as precision, recall, 和f分数，当使用不平衡的数据集时，能更好地描述模型的准确性.

Why do we need a validation set and test set? What is the difference between them?

View answer

在训练模型时，我们将可用数据分为三个独立的集:

训练数据集用于拟合模型的参数. However, 我们在训练集上获得的准确性对于预测模型在新样本上是否准确是不可靠的.
验证数据集用于测量模型在不属于训练数据集的示例上的表现. 在验证数据上计算的度量可用于调优模型的超参数. However, 每次我们评估验证数据并根据这些分数做出决定, 我们正在将验证数据中的信息泄漏到模型中. The more evaluations, the more information is leaked. So we can end up overfitting to the validation data, 再一次，验证分数对于预测模型在现实世界中的行为是不可靠的.
测试数据集用于测量模型在以前未见过的示例上的表现. 只有当我们使用验证集调优了参数后，才应该使用它.

So if we omit the test set and only use a validation set, 验证分数不能很好地估计模型的泛化程度.

你能解释一下监督学习、无监督学习和强化学习之间的区别吗?

View answer

In supervised learning, 我们训练一个模型来学习输入数据和输出数据之间的关系. 我们需要标记数据来进行监督学习.

With unsupervised learning, we only have unlabeled data. The model learns a representation of the data. 当我们有大量未标记数据和一小部分标记数据时，经常使用无监督学习来初始化模型的参数. We first train an unsupervised model and, after that, 我们使用模型的权值来训练一个监督模型.

In reinforcement learning, 该模型有一些输入数据和一个取决于模型输出的奖励. The model learns a policy that maximizes the reward. 强化学习已经成功地应用于战略游戏，如围棋，甚至是经典的雅达利电子游戏.

有哪些因素可以解释深度学习的成功和最近的兴起?

View answer

深度学习在过去十年的成功可以用三个主要因素来解释:

More data. 大量标记数据集的可用性使我们能够训练具有更多参数的模型并获得最先进的分数. 当涉及到数据集大小时，其他ML算法的可扩展性不如深度学习.
GPU. 与在CPU上训练相比，在GPU上训练模型可以将训练时间减少几个数量级. 目前，尖端的模型是在多个gpu甚至专门的硬件上训练的.
Improvements in algorithms. ReLU激活、退出和复杂的网络架构也是非常重要的因素.

What is data augmentation? Can you give some examples?

View answer

数据增强是一种通过不改变目标的方式修改现有数据来合成新数据的技术, or it is changed in a known way.

计算机视觉是数据增强非常有用的领域之一. There are many modifications that we can do to images:

Resize
Horizontal or vertical flip
Rotate
Add noise
Deform
Modify colors

每个问题都需要一个定制的数据增强管道. For example, on OCR, doing flips will change the text and won’t be beneficial; however, resizes and small rotations may help.

What are convolutional networks? Where can we use them?

View answer

卷积网络是一类使用卷积层而不是全连接层的神经网络. 在一个完全连接的层上，所有的输出单元都有连接到所有输入单元的权值. 在卷积层，我们有一些权重在输入上重复.

卷积层相对于全连接层的优势在于参数的数量要少得多. This results in better generalization of the model. For example, 如果我们想学习从一张10x10的图像到另一张10x10的图像的变换, we will need 10,000 parameters if using a fully connected layer. If we use two convolutional layers, 第一个有九个过滤器，第二个有一个过滤器, with a kernel size of 3x3, we will have only 90 parameters.

卷积网络应用于具有清晰维数结构的数据. Time series analysis is an example where one-dimensional convolutions are used; for images, 2D convolutions are used; and for volumetric data, 3D convolutions are used.

自2012年AlexNet赢得ImageNet挑战赛以来，计算机视觉一直由卷积网络主导.

10.

What is the curse of dimensionality? Can you list some ways to deal with it?

View answer

维数的诅咒是当训练数据有很高的特征计数时, 但是数据集没有足够的样本让模型从这么多特征中正确学习. For example, 具有100个特征的100个样本的训练数据集将很难从中学习，因为模型将发现特征和目标之间的随机关系. However, if we had a dataset of 100k samples with 100 features, 模型可以学习特征和目标之间的正确关系.

有不同的选择来对抗维度的诅咒:

Feature selection. 而不是使用所有的特征，我们可以在一个较小的特征子集上进行训练.
Dimensionality reduction. 有许多技术可以降低特征的维数. 主成分分析(PCA)和使用自动编码器是降维技术的例子.
L1 regularization. 因为它产生稀疏的参数，L1有助于处理高维输入.
Feature engineering. 可以创建总结多个现有功能的新功能. 例如，我们可以获得诸如平均值或中位数之类的统计数据.

面试不仅仅是棘手的技术问题, so these are intended merely as a guide. 并不是每一个值得雇佣的“A”候选人都能回答所有的问题, nor does answering them all guarantee an “A” candidate. At the end of the day, hiring remains an art, a science — and a lot of work.

Why Toptal

Submit an interview question

提交的问题和答案将被审查和编辑, and may or may not be selected for posting, at the sole discretion of Toptal, LLC.

Looking for Machine Learning Engineers?

Looking for Machine Learning Engineers? Check out Toptal’s machine learning engineers.

View full profile

View Abhimanyu

Abhimanyu Veer Aditya

Freelance Machine Learning Engineer

United StatesToptal Member Since May 7, 2019

Abhimanyu是一名机器学习专家，拥有19年为商业和科学应用创建预测解决方案的经验. He’s a cross-functional technology leader, 有组建团队和与c级高管共事的经验. Abhimanyu在计算机科学和软件工程方面有着成熟的技术背景，在高性能计算方面拥有专业知识, big data, algorithms, databases, and distributed systems.

Machine Learning Android Recommendation Systems Artificial Intelligence (AI)Predictive Analytics Linux RHEL/CentOS Algorithms Linear Regression Software Development JavaScript + more

View full profile

View Dan

Dan Napierski

Freelance Machine Learning Engineer

United StatesToptal Member Since April 28, 2016

Dan是一名专注于区块链技术应用的软件架构师和技术专家. 他拥有多年的专业咨询服务经验，为从初创公司到跨国公司的客户提供服务. 他擅长将严格的测试和防弹代码引入棘手的工程挑战. 他在人工智能的许多方面都有深厚的专业知识, blockchain, machine learning, and automation.

Machine Learning Blockchain Fintech Cryptocurrency .NET Windows RESTful Web Services Agile Software Development Artificial Intelligence (AI)C#Web App Development API Design Software Development + more

View full profile

View Johnathan

Johnathan Hebert

Freelance Machine Learning Engineer

United StatesToptal Member Since March 19, 2017

jonathan有15年的web应用编写经验，涵盖了消费者生产力软件和关键任务金融交易平台. 他拥有丰富的前端JavaScript和浏览器api知识，以及React和Redux等流行框架和库的丰富经验. Johnathan's deep full-stack experience includes Node.js和Express, MongoDB以及更传统的技术，如PHP, ASP.NET, and MySQL.

Machine Learning TensorFlow C++React Front-end Windows React Redux React Router Redux CSS JavaScript Web Development User Experience (UX)+ more

Toptal Connects the Top 3% of Freelance Talent All Over The World.

Join the Toptal community.

Learn more