如何把pci划分到不同的iommu工作原理

点击联系发帖人 时间：2016-08-26 09:07

iommu工作原理

Implementing PCI Device Passthrough (IOMMU) with Intel VT-d, KVM, QEMU, and libvirtd on Fedora 21 | Engineering Walden
A Journal of Modern Theurgy
See the addressing the topic of IOMMU-based PCI passthrough with KVM/QEMU in Fedora 22 (using an NVidia GPU, even!) for another walkthrough.
Hardware inventory
Ensure your processor supports IOMMU:
AMD processors must include AMD-Vi instructions (marked by the “svm” flag)
$ cat /proc/cpuingo | grep svm
#If no output is provided, the Linux kernel does not
#believe your processor is supported
Intel processors provide the vmx flag, but this only indicates VT-x support, which is insufficient on its own (we need VT-d).
The best way to determine support is checking the product site, confirming the processor supports VT-d, and moving to the next step.
Ensure your motherboard’s firmware supports AMD-Vi or VT-d, as appropriate (just check the vendor site or the documentation with the hardware).
Ensure the option to enable IOMMU support is selected in the motherboard BIOS or EFI
Most of the time, this option is in the North Bridge configuration or thereabouts, but check the manual for your particular motherboard for the location of this option which should include the term “IOMMU”
Boot into Fedora 21 and execute the following command to determine if the system recognizes this capability:
#Good output looks like:
$ dmesg | grep IOMMU
0.100209] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap c62 ecap f0101a
0.100214] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap c2 ecap f0105a
0.100285] IOAPIC id 2 under DRHD base
0xfed91000 IOMMU 1
$ dmesg | grep IOMMU
0.100209] AMD-Vi: Enabling IOMMU at .2 cap 0x40
0.100214] AMD-Vi: Lazy IO/TLB flushing enabled
0.100285] AMD-Vi: Initialized for Passthrough Mode
That’ if you’ve successfully completed the steps above, you have met the hardware prerequisites for this operation.
The hardware I’m using for this purpose (and whose functionality I can verify) is as follows:
Motherboard:
ASRock Z77 Extreme4
Processor:
Intel Core i5-3470 LGA-1155
The motherboard is well-equipped with USB 3.0 and 2.0 jacks, SATA II and SATA III ports, and PCI expansion card slots providing a solid, versatile platform for virtualization.
With the PCI expansion card space, one can add NICs and fast IO devices, or , to provide directly to guest domains (virtual machines).
As you will see below, choosing the correct locations for these devices requires a survey of the motherboard’s PCI architecture.
Preparing the Fedora 21 Operating System as a Virtualization Platform
Enable IOMMU support in the Linux Kernel
To test everything out without permanently modifying GRUB to boot with this support included, simply boot your machine to the GRUB menu and append “intel_iommu=on” to the end of the kernel arguments and boot.
If the system comes up normally, you should be in the clear.
If your machine does come up normally, you can execute `cat /proc/cmdline` to verify that the argument you provided to the kernel (intel_iommu=on) was recognized.
When you’ve tested your system and all is well, append the intel_iommu=on argument to the end of the GRUB_CMDLINE_LINUX string as done in the following example:
# vim /etc/default/grub
GRUB_CMDLINE_LINUX="rd.lvm.lv=fedora-server/root rd.lvm.lv=fedora-server/swap rhgb quiet intel_iommu=on"
# grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg
3) Install the Virtualization Platform group:
yum groupinstall @virtualization
Reboot ( I’d do it anyway).
Surveying Your Motherboard’s PCI Device Architecture
So, the motherboard I’m using for this demonstration is listed above.
Your results will vary depending on the way in which your motherboard is designed, but the principles laid out here should give you the understanding and tools you need to figure out the capabilities of your hardware and design on that basis.
VT-d and KVM support IOMMU remapping based on IOMMU groups, which are collections of PCI devices to which control may be passed.
You can’t always simply pass control to a
it is often the case that the hard disk will connect to a port that is part of a single controller providing a number of the other ports physically near by.
This entire controller will need to be passed to the guest, so this can require some planning to ensure you don’t assign, say, a four-port SATA controller to a guest which needs only two hard disks if it can be avoided.
The first thing to do is check out the layout of the system in virsh:
$ sudo virsh
virsh # nodedev-list --tree
+- net_lo_00_00_00_00_00_00
+- usb_usb1
+- usb_1_0_1_0
+- usb_1_1
+- usb_1_1_1_0
+- scsi_host8
+- scsi_target8_0_0
+- scsi_8_0_0_0
+- block_sdf_SanDisk_Cruzer_Fit_4C_0
+- scsi_generic_sg5
+- usb_1_2
+- usb_1_2_1_0
+- usb_usb2
+- usb_2_0_1_0
The capture above is only a small fraction of the entire device map of my motherboard.
In it, we can see a general form which will be repeated and modified throughout the tree.
The device nodes whose labels begin with “pci_” are going to be the devices we can consider passing through to our guests, though even the top-level pci_ devices may be in the same IOMMU group (and therefore require that they be jointly added to any given guest domain).
The most expedient way to go about this is to identify the PCI devices hosting the hardware you are interested in passing directly to guest domains and check on their groups by referencing their corresponding virtual file system directories in /sys/bus/pci/devices/.
For example, if I were interested in the device labeled “pci_f_0”, I would query as follows:
$ ll /sys/bus/pci/devices/f.0/iommu_group/devices/
lrwxrwxrwx. 1 root root 0 May
3 10:29 f.0 -& ../../../../devices/pci0:00:1f.0
lrwxrwxrwx. 1 root root 0 May
3 10:29 f.2 -& ../../../../devices/pci0:00:1f.2
lrwxrwxrwx. 1 root root 0 May
3 10:29 f.3 -& ../../../../devices/pci0:00:1f.3
This provides the other PCI devices sharing an IOMMU Group with the queried device.
So, what we see above is that pci_f.0 must be passed to a guest domain along with devices pci_f.2 and pci_f.3.
If we look at the portion of my device tree which represents these objects, we find that all three devices appear as top-level nodes, so there doesn’t appear to be any way to discern IOMMU Group membership from the virsh nodedev-list output.
Unfortunately, the best I know to do is investigate each PCI device of interest individually to gain an understanding of the group topology and plan accordingly.
Detaching PCI Devices in Preparation for Guest Domain Control Transfer
Assuming we had identified the IOMMU Group containing these three PCI devices (above) as the group whose control we would like to transfer to the guest domain, we must now detach these devices from the virtualization platform operating system’s kernel (which is acting as the hypervisor for the guest domains), leaving them available to be controlled by the target guest domain.
#If you're already in virsh, omit the first step:
$ sudo virsh
virsh # nodedev-dettach pci_f_0
Device pci_f_0 detached
virsh # nodedev-dettach pci_f_2
Device pci_f_2 detached
virsh # nodedev-dettach pci_f_3
Device pci_f_3 detached
Using virsh and the Virtual Machine Manager to Grant Device Control to Guest Domains
Brief Apology:
I like a command-line-centric approach to problems as much as anyone, but the Virtual Machine Manager is an excellent piece of software, and its GUI interface for the next steps is just a lot easier than the virsh path.
I may write instructions for the latter at a later date, but if you’re managing a hypervisor arrangement of this complexity, I imagine you have VMM installed on a remote workstation anyway, or you don’t need me to tell you how to do this in virsh.
Once the device groups are identified and the guest domains are ready to have the devices added (all other hardware choices have been made), open the Virtual Machine Manager and:
Select the relevant guest domain
Select Open and then Show virtual hardware details in the upper right.
Choose Add Hardware in the bottom left.
Select PCI Host Device from the list of options on the left.
Locate the PCI devices in the IOMMU Group whose control you wish to transfer to the guest domain which have the values recorded from the nodedev-list output above, converting underscores to colons as demonstrated in the sys virtual file system path above, and add them individually to the guest domain until all have been added.
So, continuing the example using the PCI device IDs provided above, I would search the list for three PCI devices whose addresses read as f:0, f:2, and f:3.
Start the Guest Domain
And enjoy!
If all has gone well, you should have no problems.
If something has gone wrong, your guest domain will likely suffer a kernel panic, so the problem will be apparent immediately.
Welcome to the cutting edge of modern system engineering!
It is flat-out amazing what we can do with commodity hardware.
Share this:Like this:Like Loading...
This entry was posted in
and tagged , , , , , . Bookmark the .
Connecting to %s
This work is licensed under a .
Categories
Recent Posts
Follow &Engineering Walden&
Get every new post delivered to your Inbox.
Join 35 other followers
%d bloggers like this:IBM Bluemix
点击按钮，开始云上的开发！
developerWorks 社区
处理器已经演变为针对虚拟环境提高性能，但 I/O 方面发生了什么变化呢？了解一种名为设备（或 PCI）透传（passthrough）的 I/O 性能增强技术，这种创新技术通过使用来自 Intel® (VT-d) 或 AMD (IOMMU) 的硬件支持改进 PCI 设备的性能。
, 自由作家
M. Tim Jones 是一名嵌入式固件架构师，他是 Artificial Intelligence: A Systems Approach, GNU/Linux Application Programming（现在已经是第 2 版）、AI Application Programming（第 2 版）和 BSD Sockets Programming from a Multilanguage Perspective 等书的作者。他的工程背景非常广泛，从同步宇宙飞船的内核开发到嵌入式系统架构设计，再到网络协议的开发。Tim 是位于科罗拉多州 Longmont 的 Emulex Corp. 的一名顾问工程师。
平台虚拟化是在两个或多个操作系统之间共享一个平台，以便更有效地利用资源。但平台并不只是意味着一个以上的处理器，它还包含组成平台的其他重要元素，比如存储器、网络和其他硬件资源。某些硬件资源可以轻松虚拟化，比如处理器和存储器；而另一些硬件资源则不然，比如视频适配器和串口。当共享不可能或没用时，Peripheral Component Interconnect (PCI) 透传技术提供有效使用这些资源的方法。本文探索透传（passthrough）技术的概念及其在管理程序（hypervisor）中的实现，详细介绍支持这个最新创新技术的管理程序。平台设备模拟在探索透传技术之前，让我们先讨论一下如今设备模拟在两个管理程序架构中是如何工作的。第一个架构将设备模拟整合到管理程序中，而第二个架构将设备模拟推到管理程序之外的一个应用程序中。管理程序中的设备模拟是在 VMware 工作站产品（一个基于操作系统的管理程序）中实现的一个公共方法。在这个模型中，管理程序包含各种客户操作系统能够共享的公共设备，如虚拟磁盘、虚拟网络适配器和其他必需的平台元素。这个特定模型如图 1 所示。图 1. 基于管理程序的设备模拟第二个架构称为用户空间设备模拟（见图 2）。顾名思义，这种设备模拟是在用户空间中实现的，而不嵌入到管理程序中。QEMU（不仅提供设备模拟，还提供一个管理程序）提供设备模拟，用于大量独立管理程序，如 Kernel-based Virtual Machine (KVM) 和 VirtualBox 等。这个模型更具优势，因为设备模拟独立于管理程序，因而可以在多个管理程序之间共享。另外，这个模型还支持任意设备模拟，无须管理程序（以特权状态运行）负担这个功能。图 2. 用户空间设备模拟将设备模拟从管理程序推向用户空间有一些明显的优势，最大的优势涉及所谓的可信计算基础（trusted computing base，TCB）。一个系统的 TCB 是对该系统安全性很关键的所有安全组件的集合。有一点是显而易见的：如果系统被最小化，出现 bug 的可能性也就更小，因此系统也就越安全。这个原理也适用管理程序。管理程序的安全性很重要，因为它分隔多个独立的客户操作系统。管理程序中的代码越少（将设备模拟推到特权较低的用户空间中），将特权泄露给不可信用户的机率也就越少。基于管理程序的设备模拟的另一个变体是准虚拟化（paravirtualized）驱动程序。在这个模型中，管理程序包含物理驱动程序，每个客户操作系统包含一个管理程序可以感知的驱动程序，这个驱动程序与管理程序驱动程序（称为准虚拟化或 PV 驱动程序）配合工作。无论设备模拟发生在管理程序内还是在一个客户虚拟机（VM）之上，模拟方法都是相似的。设备模拟能够模拟一个特定设备（如 Novell NE1000 网络适配器）或一个特定磁盘类型（如 Integrated Device Electronics [IDE]）。物理硬盘可以完全不同 — 例如，尽管一个 IDE 驱动器被模拟为客户操作系统，物理硬件平台可以使用一个串口 ATA (SATA) 驱动器。这种技术很有用，因为 IDE 支持在许多操作系统中都很普遍，可以用作一个通用标准，而不是要求所有操作系统都支持更高级的驱动器类型。设备透传技术正如上面介绍的两个设备模型所示，设备共享是有代价的。无论设别模拟是在管理程序还是在一个独立 VM 中的用户空间中执行，都存在开销。只要有多个客户操作系统需要共享这些设备，这个开销就是值得的。如果共享不是必须的，则有更有效的方法来共享这些设备。因此，在最高层面上，设备透传就是向一个特定客户操作系统提供一种设备隔离，以便该设备能够被那个客户操作系统独占使用（见图 3）。但这种技术为什么有用？设备透传之所以有价值，原因有很多，其中两个最重要的原因是性能以及提供本质上不能共享的设备的专用权。图 3. 管理程序内的设备透传对于性能而言，使用设备透传可以获得近乎本机的性能。对于某些网络应用程序（或那些拥有高磁盘 I/O 的应用程序）来说，这种技术简直是完美的。这些网络应用程序没有采用虚拟化，原因是穿过管理程序（达到管理程序中的驱动程序或从管理程序到用户空间模拟）会导致竞争和性能降低。但是，当这些设备不能被共享时，也可以将它们分配到特定的客户机中。例如，如果一个系统包含多个视频适配器，则那些适配器可以被传递到特定的客户域中。最后，可能有一些只有一个客户域使用的专用 PCE 设备，或者有一些不受管理程序支持因而应该被传递到客户机的设备。单独的 USB 端口可以与一个给定域隔离，一个串口（自身不是可共享的）可以与一个特定客户机隔离。设备模拟背后的秘密早期的设备模拟类型在管理程序中实现影子（shadow）形式的设备接口，以便为客户操作系统提供一个到硬件的虚拟接口。这个虚拟接口包含预期的接口，包括表示设备（如 shadow PCI）的虚拟地址空间和虚拟中断。但是，由于有一个设备驱动程序与虚拟接口通信，且有一个管理程序为实际硬件转换这种通信，因此开销非常大 — 特别是在诸如网络适配器之类的高带宽设备中。Xen 使 PV 方法（上一小节介绍过）得以流行，PV 方法通过使客户操作系统驱动程序意识到它正在被虚拟化来减少性能降低幅度。在本例中，客户操作系统将不会看到一个设备（比如网络适配器）的 PCI 空间，而是一个提供高级抽象（比如包接口）的网络适配器应用程序编程接口（API）。这种方法的缺点是客户操作系统必须针对 PV 进行修改，优点是在某些情况下您可以得到近乎本机的性能。在设备透传技术早期发展过程中，开发人员使用一个瘦模拟模型，在该模型中，管理程序提供基于软件的内存管理（将客户操作系统地址空间转换为可信主机地址空间）。尽管开发人员在早期提供了隔离一个设备和一个客户操作系统的方法，但那种方法缺乏大型虚拟化环境需要的性能和伸缩性。幸运的是，处理器供应商已经为下一代处理器装备了一些指令，以支持管理程序和用于设备透传的逻辑，包括终端虚拟化和直接内存访问（DMA）支持。因此，新的处理器提供 DMA 地址转换和权限检查以实现有效的设备透传，而不是捕获并模拟对管理程序下的物理设备的访问。设备透传的硬件支持Intel 和 AMD 都在它们的新一代处理器架构中提供对设备透传的支持（以及辅助管理程序的新指令）。Intel 将这种支持称为 Virtualization Technology for Directed I/O (VT-d)，而 AMD 称之为 I/O Memory Management Unit (IOMMU)。不管是哪种情况，新的 CPU 都提供将 PCI 物理地址映射到客户虚拟系统的方法。当这种映射发生时，硬件将负责访问（和保护），客户操作系统在使用该设备时，就仿佛它不是一个虚拟系统一样。除了将客户机映射到物理内存外，新的架构还提供隔离机制，以便预先阻止其他客户机（或管理程序）访问该内存。Intel 和 AMD CPU 提供更多虚拟化功能，您可以在
部分了解更多信息。另一种帮助将中断缩放为大量 VM 的技术革新称为 Message Signaled Interrupts (MSI)。MSI 将中断转换为更容易虚拟化的消息（缩放为数千个独立中断），而不是依赖将被关联到一个客户机的物理中断 pin。从 PCI 2.2 开始，MSI 就已经可用，但 PCI Express (PCIe) 也提供 MSI，在 PCIe 中，MSI 支持将结构缩放为多个设备。MSI 是理想的 I/O 虚拟化技术，因为它支持多个中断源的隔离（而不是必须通过软件多路传输或路由的物理 pin）。设备透传的管理程序支持使用最新的支持虚拟化的处理器架构，有多个管理程序和虚拟化解决方案支持设备透传。您将在 Xen 和 KVM 以及其他管理程序中发现设备透传支持（使用 VT-d 或 IOMMU）。在多数情况下，客户操作系统（域为 0）必须被编译为支持透传，这通常作为一个内核构建时选项提供。也许还需要对主机 VM 隐藏设备（Xen 中使用 pciback 实现）。PCI 中有一些限制（例如，一个 PCIe-to-PCI 桥接器后面的 PCI 设备必须被分配到相同的域），但 PCIe 没有这种限制。另外，您将在 libvirt（以及 virsh）中发现设备透传的配置支持，这为底层管理程序使用的配置模式提供一个抽象。设备透传问题设备透传带来的一个问题体现在实时迁移方面。实时迁移是指一个 VM 在迁移到一个新的物理主机期间暂停迁移，然后又继续迁移，该 VM 在这个时间点上重新启动。实时迁移是在一个物理主机网络上支持负载平衡的一个很好的特性，但使用透传设备时它会产生问题。PCI 热插拔（有几个关于它的规范）就是需要解决的一个问题。PCI 热插拔允许 PCI 设备从一个给定内核进出，这很理想 — 特别是将 VM 迁移到新主机上的管理程序时（设备需要在这个新管理程序中拔出然后再插入）。当设备被模拟（比如虚拟网络适配器）时，模拟提供一个抽象层以抽象物理硬件。这样，一个虚拟网络适配器可以在该 VM 内轻松迁移（这个 VM 还得到 Linux® 绑定驱动程序的支持，该驱动程序支持将多个逻辑网络适配器绑定到相同的接口上）。I/O 虚拟化的未来I/O 虚拟化的未来实际上已经在今天实现。例如，PCIe 包含虚拟化支持。一种适合服务器虚拟化的虚拟化概念被称为 Single-Root I/O Virtualization (SR-IOV)，这种虚拟化技术（通过 PCI-Special Interest Group 或 PCI-SIG 创建）在单根复杂实例（在本例中为一个带有多个 VM 的服务器，这些 VM 共享一个设备）中提供设备虚拟化。另一个变体（称为 Multi-Root IOV）支持大型拓扑（比如刀片服务器，其中多个服务器能够访问一个或多个 PCIe 设备）。从某种意义上说，这种技术支持任意规模的大型设备网络，该网络可以包含服务器、终端设备和交换机（用于设备发现和包路由）。通过 SR-IOV，一个 PCIe 设备不仅可以导出多个 PCI 物理功能，还可以导出共享该 I/O 设备上的资源的一组虚拟功能。这个简化的服务器虚拟化架构如图 4 所示。在这个模型中，不需要任何透传，因为虚拟化在终端设备上发生，从而允许管理程序简单地将虚拟功能映射到 VM 上以实现本机设备性能和隔离安全。图 4. 通过 SR-IOV 实现透传结束语虚拟化的发展已经历经 50 多年，但直到现在 I/O 虚拟化才引起广泛注意。商业处理器虚拟化支持只出现了 5 年时间。因此，从本质上说，平台和 I/O 虚拟化将如何发展已迫在眉睫。作为诸如云计算之类的未来架构的关键元素，虚拟化肯定会成为值得关注的有趣技术。和往常一样，Linux 处于支持这些新技术的前沿阵地，最新的内核（2.6.27 或更高）已经开始包含对这些新的虚拟技术的支持。
提供关于带有支持设备透传的 VT-d 的 Xen 的详细信息。Xen 在其网站上提供了大量相关信息。提供关于支持 Xen 管理程序的 PCI 透传的一些细节。为构建用于管理 hypervisor 的应用程序提供一个管理 API。libvirt Web 站点上的这个 wiki 讨论了
所需的技术和支持。这个来自 Intel 的针对 Fedora 项目的
在设备透传上下文中讨论了 Linux VM 的实时迁移问题。在
中下载关于 Single-Root 和 Multi-Root IOV 技术的说明，这些技术在单根（单宿主）或多根（多宿主，如刀片服务器）拓扑中提供 I/O 虚拟化。这些技术是 PCI-SIG 的产品。参阅 “”（developerWorks，2006 年 12 月）了解其他虚拟化解决方案，您还可以在 “”（developerWorks，2007 年 4 月）和 “”（developerWorks，2007 年 9 月）中深入了解关于 KVM 和 QEMU 的更多细节。
寻找为 Linux 开发人员（包括）准备的更多参考资料，查阅我们。在 developerWorks 上查阅所有
和。随时关注 developerWorks 和。
使用可以直接从 developerWorks 下载的，在 Linux 上构建您的下一个项目。
加入；您可以通过个人档案和定制主页获得符合自己的兴趣的 developerWorks 文章，并与其他 developerWorks 用户进行交流。
developerWorks: 登录
标有星（*）号的字段是必填字段。
保持登录。
单击提交则表示您同意developerWorks 的条款和条件。查看条款和条件。
在您首次登录 developerWorks 时，会为您创建一份个人概要。您的个人概要中的信息（您的姓名、国家/地区，以及公司名称）是公开显示的，而且会随着您发布的任何内容一起显示，除非您选择隐藏您的公司名称。您可以随时更新您的 IBM 帐户。
所有提交的信息确保安全。
选择您的昵称
当您初次登录到 developerWorks 时，将会为您创建一份概要信息，您需要指定一个昵称。您的昵称将和您在 developerWorks 发布的内容显示在一起。昵称长度在 3 至 31 个字符之间。
您的昵称在 developerWorks 社区中必须是唯一的，并且出于隐私保护的原因，不能是您的电子邮件地址。
标有星（*）号的字段是必填字段。
(昵称长度在 3 至 31 个字符之间)
单击提交则表示您同意developerWorks 的条款和条件。 .
所有提交的信息确保安全。
文章、教程、演示，帮助您构建、部署和管理云应用。
立即加入来自 IBM 的专业 IT 社交网络。
为灾难恢复构建应用，赢取现金大奖。
static.content.url=/developerworks/js/artrating/SITE_ID=10Zone=LinuxArticleID=448100ArticleTitle=Linux 虚拟化和 PCI 透传技术publish-date=}

我爱游戏网