文档文库

手机版

投诉建议

首页 > 计算机中文文献翻译英文翻译

计算机中文文献翻译英文翻译

发布时间：2015-04-12 11:13:07 来源：文档文库

小中大

字号：

手机查看

文献翻译

中文翻译稿

Java处理器的评估

1．引言

在本文中，我们将提出Java处理器关于尺寸和性能的评价结果。此Java处理器被称为JOP -主张优化的Java处理器- ，基于这样的假设，一个全面的本地执行所有Java虚拟机（JVM），字节码指令不是一个有用的办法。 JOP是Java处理器的嵌入式实时系统，特别是一个小的处理器资源受限设备的时间可预测的执行Java程序。

表1列出了相关的可用Java处理器。Sun公司于1997年推出picoJava的第一个版本。在研究性论文中,Sun公司的picoJava是经常提到的Java处理器。它是用作新Java处理器的参考，并且作为提高Java处理器各方面研究的基础。具有讽刺意义的是，该处理器从未被Sun作为产品释放过。1999年被重新设计，被称为picoJava - 2是目前免费提供了一套丰富的文件。picoJava的结构是一种基于堆栈的CISC处理器，可执行341种不同的指令，是最复杂的Java处理器，该处理器可以执行在约440K盖茨。

AJile的JEMCore是一种直接执行Java处理器，可作为一个IP核心和独立的处理器。它是基于32位JEM2 Java芯片开发的罗克韦尔-科林斯。该处理器包含零等待状态48KB RAM和外围元件。16KB的内存用于存储写入控制。其余的32KB用于存储处理器堆栈。

月亮火神处理器是JVM运行在一个FPGA芯片的一个执行。执行模型是常用的各种直接，微码和被困执行。一个简单的堆栈折叠的实施，以减少记忆体周期5至三年的指令序列像按压式添加。该Moon2处理器可作为一个加密的高密度脂蛋白来源为Altera的FPGA或VHDL或Verilog源代码。

该32位核心是一种基于哈佛结构的混合式8/32-位处理器。程序存储器是8位宽，数据存储器是32位宽。核心包含一个3级流水线的整数运算单元，一个桶式移位器和一个2位乘法单元。根据DCT变换，在同一时钟速度下，该性能通常是RISC运行速度的8倍。

科莫多是一个拥有四阶段的多线程Java处理器。其目的是以此为基础研究实时调度的多线程微控制器。特色科莫多是教学单位，取4个独立的程序计数器和状态标识，为4个线程。一个优先经理负责硬件实时调度和可以选择一个新线程在每一字节码指令。

FemtoJava 是一个研究项目，以建立一个特定的应用Java处理器。使用的字节码的嵌入式应用进行了分析和自定义版本的FemtoJava产生，从而最大限度地减少资源的使用。飞秒Java是不包括在第四节，由于处理器不能运行即使最简单的基准。

除了真正的Java处理器少数第四芯片（Cjip PSC1000）的销售如Java处理器。 Java的协处理器（Jazelle JSTAR ）提供的Java运行速度的通用处理器。

从表一我们可以看到，在FPGA芯片上JOP是硬件JVM的最小实现，也具有最高的时钟频率。

下面一节中，将给出JOP结构的概述，随后将更详细的介绍微码。第三节比较JOP与其他软核处理器的资源利用。第四节中，在字节码级和应用水平上，将比较嵌入式Java的一些不同解决方案。

2． JOP 结构

JOP是拥有自己指令集的堆栈电脑，本文中称为微码。Java的字节码被翻译成微码指令或微码序列。JVM和JOP之间的区别描述如下：

JVM是CISC的堆栈结构，而JOP是RISC的堆栈结构。

图1显示JOP的主要功能单元。JOP典型的配置包含一个处理器核，一个内存接口和一些输入输出设备。

处理器核心包含三个阶段,如微码通道如微码取解码和执行和额外的转换阶段取字节码。这个模块叫做扩展提供了在处理器核心。这个港口到另一个模块的地址和数据总线的字节码指示,两支顶级元素的堆栈(A组和B组)、输入数据和大量的控制信号。没有直接连接处理器核心和外部世界。

内存接口为主存和处理器核之间提供联系。它还包含了一个字节码高速缓存器。这个扩展模块控制数据的读和写。繁忙的信号用于指导等同步如微码存储器处理。核心读字节码指示在专用巴士(公元前地址和公元前数据),从内存子系统。

该扩展模块执行三项职能：（a）它包含硬件加速器（如倍增单位在这个例子中），（b）控制的内存和I/O模块，（c）复用器的读取数据这是加载到最先进的堆栈注册。写入数据从顶级栈（A）直接连接到所有模块。

A.处理器通道

JOP是一个完全流水线架构，微码指令执行时间是单周期，用一种新的方法来绘制Java字节码到这些指令中去。图2显示JOP的数据路径。

段组成JOP通道的三个核心阶执行微码指令。另外一个阶段的正面核心通道获取的Java字节码-的指示的JVM -这些字节码，然后转换成地址的微码。字节码分行还解码和执行的这个阶段。第二阶段通道获取JOP指示从内部微存储器和微执行分支机构。除了通常的解码功能，第三个通道阶段也会产生地址栈内存。正如每一个堆栈机器指令或者POP或推动的特点，有可能产生或泄漏地址填写下列指示在这个阶段。过去通道阶段执行运算单元操作，装载，存储和堆栈溢出或填写。在执行阶段，行动都以最高的两个要素的堆栈。

堆栈机器有两个明确的选民登记册，供这两个最上层堆栈内容和自动填写/泄漏既不需要额外写回阶段也没有任何数据转发。细节两个级别堆叠架构中所描述。短期通道结果短枝延误。因此，很难分析，对最坏情况执行时间（WCET），分支预测的逻辑是可以避免的。

B.中断逻辑

中断被认为是难以处理的流水线处理器，这意味着执行情况往往是复杂的（并因此消耗资源）。在JOP中，巧妙地使用字节码微核的翻译，以避免中断处理中的核心项目。

中断执行，特别字节码。这些字节码插入的硬件在Java指令流。当一个中断，并正在等待下一个牵强字节的字节码缓存的指示，联系特别字节码是用来代替指令从字节码缓存。其结果是中断均接受字节码界限。在最坏的情况是先发制人拖延执行时间的最慢的字节码是实施微。字节码的执行在Java（见第二节）可以被打断。

执行中断的字节码，微码映射阶段保持中断透明核心通道，避免复杂的逻辑。中断处理程序可以以相同的方式来执行，当标准字节码被执行在微或Java 。这一个特殊代码可能会导致出现一个电话的JVM内部法的背景下中断线程。这一机制含蓄的商店几乎完整的背景下，当前的积极线程的堆栈。

C.缓存

流水线处理器架构要求更高的内存带宽。标准的技术，避免在处理过程中的瓶颈，由于更高的内存带宽是缓存。但是，缓存组织标准提高平均执行时间，但很难预测的WCET分析。JOP可预见的缓存提议：堆栈缓存作为替代的数据高速缓存，高速缓存缓存方法的指示。

由于堆栈是一个沉重存取记忆体区域，堆栈-或部分-是放置在片上存储器。这部分的堆栈被称为堆栈缓存。填补和泄漏的堆栈缓存受到微控制，因此，时间可预测性。

给出了一种新的方式组织一个指令高速缓存，高速缓存的方法。缓存商店完成方法，以及高速缓存失误只出现在方法调用和返回。缓存块替换取决于呼吁树，而不是指令地址。这种方法很容易缓存分析方面的最坏情况的行为，仍然提供了大量的性能比较，对解决没有指令缓存。

D.微码

下面的讨论关注两个不同的指令集：字节码和微。字节码的指示，使一个编译Java程序。这些指示是由Java虚拟机。将JVM不承担任何特定的实现技术。微码是原生指令集的JOP 。字节码的翻译，在其执行，到JOP微。这两个指令集是专为一个extended2堆叠机。

(1)．翻译字节码，微码：迄今为止，没有任何硬体执行的JVM存在，在硬件上能够执行所有字节码。这是由于以下方面：一些字节码，如新，而创建并初始化一个新的对象，是过于复杂，实施中的硬件。可以用软件仿真这些字节码。

为了建立一个独立的JVM没有底层操作系统，直接访问内存和I/O设备是必要的。没有字节码定义为低级别的访问。这些低层次的服务通常是在本地执行的职能，这意味着另一种语言（c）是本地的处理器。然而，对于一个Java处理器，字节码是母语。

其中一个方法来解决这个问题是执行简单的字节码在硬件和仿效更加复杂和本地的软件功能，以不同的指令集（有时也称为微码）。然而，一个处理器有两个不同的指令集，结果在一个复杂的设计。

另一种较常见的解决方案，用于Sun的picoJava，是执行的一个子集的字节码本地和使用软件陷阱执行剩余。该解决方案包含了开销（至少16个周期picoJava）的软件陷阱。

在JOP中这个问题得到解决，在一个更简单的方式。JOP了一个单一的原生指令集，即所谓的微码。在执行过程中，每一个翻译的Java字节码的任何一个，或一个序列的微操作。这仅仅是增加了一个翻译通道阶段的核心处理器和结果没有执行费用。这一解决方案，我们可以自由地确定JOP指令集，以图顺利堆栈结构的JVM ，并找到一个指令编码，可实现最低限度的硬件。

图3给出了一个例子，这个数据流从Java程序计数器JOP微。字节码的牵强充当指数为跳转表。跳表所包含的开始地址将JVM执行微。此地址加载到JOP程序计数器每个字节码执行。

在微码中每个字节码被转换成一个地址如JVM已实现。如果存在着相当于微指令的字节码，这是执行在一个周期的下一个字节码和翻译。对于更复杂的代码，JOP只是继续执行微在随后的周期。本月底是编码序列的微码指令（视nxt位）

(2)．微：对虚拟机执行效率，已经到了微适合的Java字节码。由于虚拟机是一个堆叠机。但是，虚拟机是不是一个单纯的堆栈机器。方法的参数和局部变量的定义是当地人。这些居民可以住在堆栈帧的方法，并访问抵消相对当地人开始这一领域。这些变数担任从零开始变量，如选民登记册在常规的CPU 。然而，算术和逻辑运算都是在栈中。

一些字节码，如运算单元操作和短期的形式进入当地人，直接实施相当于微指令（使用不同的编码）。附加说明，可访问内部寄存器，主内存和I /O设备。相对条件分支（零/非零的服务条款）进行控制流决定在微水平。最佳利用现有的存储资源，所有的指示， 8位长。没有可变长度指令，每个指令，除了等待，执行，在一个周期。为了让指令集，这个密集，两个概念是适用于：

两种类型的运算，直接价值和分支的距离，一般部队的指令集将超过8位。指令集是要么扩大到16或32位，在典型的RISC处理器，或允许在可变长度的字节边界。第一个实施的JVM的16位指令集表明，只有少数几个不同的常量是必要立即价值观和相对处的距离。

在当前实现JOP ，立即价值观的不同而收集的微正在组装并投入初始化文件为本地内存。这些间接访问的常数相同的方式为局部变量。它们是类似的初始化变量，除了一个事实，即没有任何行动，以改变它们的值在运行时，将没有任何意义，将浪费指令代码。

类似的解决办法是用于处的距离。汇编器生成的VHDL文件与表找到的所有分支常数。此表的索引使用指示位在运行时。在运行时能够保持一个8位指令集，并提供16种不同的直接价值和32种不同科常数。对于一般用途的指令集，将对太多的限制。随着微只实现了虚拟机，这解决方案是一个可行的选择。

为了简化逻辑指令解码，编码的指示是精心挑选的。举例来说，有一点是明确的指示，指示将递增或递减的堆栈指针。抵销访问当地人直接编码的指示。情况并非如此的原始编码相当于字节码（例如iload 0 0x1a和iload 1 0x1b ）。

(3)．灵活执行的字节码：如上所述，一些Java的字节码是非常复杂的。解决方案已经说明是效仿他们通过一系列的微操作。但是，一些比较复杂的字节码是很少使用。为了进一步减少资源的影响JOP ，在这种情况下，本地存储器，字节码，甚至可以执行用Java字节码。在大会期间的虚拟机，所有的标签，代表了一个切入点的字节码执行用于生成转换表。所有字节码，这些地区没有发现这种标签，即没有实施微。该指令序列此地址从系统级调用一个静态方法，这个类包含256静态方法，为每个可能的字节码，命令的字节码值。该字节码作为该指数的方法表这一系统级。此功能也可轻松地配置资源使用与绩效。

3．资源使用

成本与能量消耗是嵌入式系统的一个重要问题。芯片的成本直接关系到模具的尺寸（模具的成本大与每平方裸片面积成正比）。用较少的门的芯片消耗更少的能量。嵌入式系统的处理器可以通过减小芯片尺寸优化。

发展JOP一个主要的设计目标是建立一个小系统，可以实施低成本的FPGA。表二显示了使用JOP的不同资源配置和不同的软核处理器,在执行EP1C6 Altera的FPGA实现。估计相当于门计数的设计在一个FPGA芯片上是有问题的。因此，更好地进行比较的两个基本结构，逻辑单元（立法会）和嵌入式存储器块。

所有配置的JOP包含一个内存接口，32位的静态RAM和一个8位闪存Java程序和FPGA配置数据。最低配置实现乘法和行动的转变微。在基本配置，这些行动正在实施，作为连续展位乘数和桶式移位器。典型的配置还包含了一些有益的I/O设备，如UART和定时器中断逻辑的多线程。典型的配置JOP需求约30％的信用证在气旋EP1C6，从而留下足够的资源免费提供给特定应用逻辑。

尼奥斯作为参考，Altera的热门RISC软核心，也是列入名单。尼奥斯有16位指令集，一个5级流水线，并且可以配置16位或32位数据通路。版本A是最低配置的Nios。版本B增加了一个外部存储器接口，支持和乘法计时器。版本A是可比的最小配置的JOP，和版本B的典型配置。

SPEAR （可扩展处理器的嵌入式应用在实时环境）是一个16位处理器，具有决定性的执行时间。SPEAR包含前提指示支持单路节目。SPEAR 是列入清单的，因为它也是一个处理器设计的实时系统。

为了证明JOP的VHDL代码是尽可能的便携式，JOP还在Xilinx Spartan-3 FPGA上实现了。只有实例化和初始化代码，芯片上的存储器是特定于供应商的，而其余的VHDL代码可以被不同目标所共享。在Spartan设备上JOP消耗相同的LC计数，但有一个较低的时钟频率（83MHz）。

从这个对比我们可以看到，我们已达到我们的目标是设计一个小的处理器。商业Java处理器娜莱是JOP的基本配置2.3倍（2.5倍较慢）。一个典型的32位RISC处理器消耗约1.6至1.8倍的资源。然而，RISC处理器可以跑出20 ％的速度比JOP在相同的技术。尺寸上类似的唯一的处理器是SPEAR。然而，尽管SPEAR是16位处理器，JOP包含一个32位数据通路。

表三为JOP，picoJava，aJile处理器提供了门数估计，英特尔奔腾MMX处理器，用于在基准下一节。等效门数为LC5之间5.5和7.4 -我们选择的一个因素6盖茨每LC和1.5盖茨每个存储位的估计门数为JOP在表格中。JOP中所列典型的配置，消耗1831信用证。奔腾MMX包含4.5米晶体管这相当于1125K盖茨。

我们可以从表上看到芯片上的内存主宰了整个门数的JOP，并在更大程度上，对aJile处理器。aJile处理器的12倍左右大于JOP。

4．性能

运行基准是有问题的，双方，尤其是在案件的嵌入式系统。最好的指标将是应用程序，是为了在系统上运行测试。要比较的结果规格为基准的各种制度。然而，一个用于Java的SPECjvm98，通常是过大的嵌入式系统。

由于没有一个标准的Java标准的嵌入式系统，一个小基准诉讼，应运行即使是最小的设备是这里提供。它包含一些微型基准评价若干时钟周期为单字节码或短序列的字节码，和两个应用的基准。为了提供一个切合实际的工作量，嵌入式系统，实时应用是适应建立第一个应用程序基准测试（韩国自由联盟）。中的应用是从其中一个节点的分布式电机控制系统。模拟的环境（传感器和执行者）和通信系统（来自主站的命令）的组成部分的基准，以模拟真实世界的工作量。第二个应用程序基准测试是嵌入式Java的小型TCP/IP堆栈的一个适应。这指标包含两个UDP服务器/客户端，通过回环设备交换信息。

正如我们将看到，在不同的嵌入式系统，处理能力有很大的变化。为了应付这种变化，所有基准的自我调整。每个基准包括基准循环这一方面。循环计数以适应自己，直到基准运行超过一秒钟。然后计算每秒迭代的数量，这意味着更高的价值显示更好的效果。

所有的基准衡量往往是一个职能是执行每秒。在韩国自由联盟基准，此功能包含了主回路的应用程序在执行定期周期在原来的应用程序。基准的等待下一期的遗漏，使测量的时间仅代表执行时间。UDP数据基准包含代要求，转递通过的UDP/IP协议栈，产生的答案和转递回作为基准功能。循环计数适应本身的运行，直到基准超过一秒钟。迭代的数量，然后每秒计算，这意味着更高的价值显示更好的效果。

下面的列表简要介绍了Java系统：

JOP是实施气旋的FPGA，运行在100MHz 。主要的记忆是一个32位的SRAM（15ns）的存取时间为2时钟周期。该基准配置JOP包含快取4KB方法组织的16个区块。

LeJOS的典范低端嵌入式设备我们使用RCX机器人控制器的MINDSTORMS从乐高系列。它包含一个16位日立H8300微控制器，运行速度为16MHz。LeJOS是一种微小的解释的JVM的RCX。

TINI的是增强型8051克隆软件运行的JVM。结果表明，从一个自定义的20MHz局与晶体，以及芯片的PLL频率设置为一个因素2。

KVM是Sun公司的，认为是有限连接设备配置（CLDC）以Alteras Nios II处理器的微Linux操作系统。尼奥斯是实施气旋与FPGA和50MHz的频率。除了不同的时钟频率，这是一个很好的解释比较的JVM中运行相同的FPGA作为JOP。

基准的结果，得到了科莫多马蒂亚斯佩弗关于周期精确的仿真科莫多。AJile的JEMCore是一种直接执行Java的处理器，它有两个不同版本：在AJ80和AJ100。开发系统包含AJ80与一个8位内存，时钟频率为74MHz。该SaJe局从Systronix包含aJ100这是主频103MHz，并包含与10ns 32位的SRAM 。

该EJC（嵌入式Java控制器）平台就是一个典型的例子的生产系统的RISC处理器。该系统是基于32位ARM720T处理器，运行在74MHz。它最多可包含64MB的SDRAM和高达16 MB的NOR闪存。

Gcj是Java的GNU编译器。此配置代表了一批编译器解决方案，运行在266MHz奔腾Linux下。

MB是实现Java时RISC处理器的一个FPGA。Java是汇编到C的Java编译器的实时系统和C编译程序的标准GNU工具链。

图4中，几何平均数的两个应用基准证明。该单位使用的结果是反复每秒。请注意，纵轴是个对数，以获取有用的数字显示的巨大变化的表现。顶端图显示绝对业绩，而底部图显示相同的结果推广到1MHz的时钟频率。结果应用基准和几何平均值列于表四。

应当指出的是，扩展到一个单一的时钟频率能够证明问题。处理器的时钟频率和内存存取时间不一定能保持下去。举一个例子，如果我们要增加结果的100MHz的JOP到1GHz，这也将涉及减少内存访问时间从15ns到1.5ns。处理器，1GHz的时钟频率已经面市，但最快的异步SRAM迄今访问时间10ns。

A.讨论

在比较JOP和AJile处理器对LeJOS，TINI和KVM，我们可以看到，一个Java处理器是高达500倍的速度进行了解释的JVM标准处理器的嵌入式系统。平均性能JOP甚至不如的JIT编译器解决方案，嵌入式系统，所代表的EJC系统。

即使规模相同的时钟频率，汇编在PC机上的JVM （Gcj）的是速度远远超过任何嵌入式解决方案。然而，内核的应用小于4KB。因此，它适合在一级缓存的Pentium MMX （16KB+16KB）。对于一个比较奔腾级处理器，我们需要一个更大的应用。

JOP约7倍AJ80 Java处理器的流行JStamp局。然而，aJ80处理器仅包含一个8位内存接口，并患有这一瓶颈。该SaJe系统包含aJ100 32位，10ns的静态存储器，是10％左右，低于JOP其15ns SRAM内存。

MicroBlaze系统是一个代表性的Java RISC处理器。MicroBlaze配置相同cache6作为JOP和频率相同的频率。JOP约快4倍，超过这一解决方案，从而显示，本地执行Java字节码的速度超过批次编译Java建立类似的系统。但是，结果MicroBlaze解决正处在一个初步stage7 ，因为Java2C编译目前仍在发展。

微观基准的目的是深入了解执行的JVM。在表五，我们可以看到的执行时间在时钟周期的各个字节码。因为几乎所有字节码操作栈，它是不可能的措施的执行时间为一个单一的字节码。作为一项起码要求，第二个指令是必要的，以扭转堆栈操作。汇编版本的JVM ，这些微型基准不产生有益的结果。编译器进行优化，使之无法衡量的执行时间在这个罚款1粒度。

我们可以推断，该WCET简单的字节码也是平均执行时间。我们可以看到，结合iload和iadd执行在两个周期，这意味着这两个行动的执行在一个周期。字节码的iinc是少数指示不操纵栈和可以衡量的。作为iinc不是硬件实施，我们已经总共有11个周期，在微执行。它是公平的，这包括承担太大的开销的指示，发现在每一个迭代循环，一个整数索引。然而，决定执行这一指示在微源于观察的动态指令计数iinc只有2％。

序列的分支基准（如icmplt ）载有两个负载指示，推动论点到堆栈。然后消费的分支指令。这一基准验证分行需要不断四个周期JOP，无论采取与否。

在评价aJile系统，意外的行为进行了观察。该aJ80的JStamp局工作频率为7.3728MHz和内部频率可以设置一个锁相环。该aJ80额定80MHz和最高频率的因素，可用于因此10 。运行基准不同的PLL设置了一些奇怪的结果。例如，如果设置的PLL乘法器10 ，该aJ80约为12.8倍更快！其他的PLL因素也导致了大于线性加速。唯一的解释是，我们能找到的内部时间，用于基准取决于锁相环设置。以比较的挂钟时间表明，内部时间aJ80是23 ％，以更快的PLL因子1和2.4 ％的速度的因素10 -属性，我们不会期望在处理器市场上的实时时间系统。该委员会还可以SaJe遭受所描述的问题。

B.执行时间抖动

实时系统，最坏情况下的执行时间是相当重要的。我们测量了执行时间的几个反复的主要职能从韩国自由联盟基准。图5显示的测量，规模的最低执行时间。

从一个四周期的迭代过程中可以看出。这一时期从模拟结果从基站命令执行每隔4迭代。在迭代10中，一个命令，启动马达发出。我们认为，由此而造成的执行时间，迭代12处理此命令。在迭代54中，仿真触发传感器和电机最后停止。

不同应用模式的不同的执行时间在仿真设计中是固有的。然而，在JStamp中时间最长和最短期限之间的比例为五，在Gcj系统中为四，在JOP系统中只有三。因此，一个带有AJile处理器系统需要比JOP系统快1.7倍，以提供相同的WCET测量。在迭代33 中，我们可以看到JStamp系统更高的执行时间，在JOP中没有看到。在迭代33下这种变化不是由基准造成的。

Linux系统下Gci的执行时间表明一些高峰(最低10倍, 数字中没有显示)。这个观察是预料之中的事，因为Gcj/Linux系统不是一个实时的解决方案。JITsolution也进行测量,但没有数字。在仿真的某些时候，由于编译器的调用，最大的和最小的运行时间之间的最差比例是1313，说明JITcompiler在实时应用中是不实际的。

应当指出的是，执行时间测量并不是一种获得WCET估计的安全方法。然而，在没有WCET分析工具情况下是可行的，它可以提供不同系统的WCET行为的一些洞察。

外文文献资料

Evaluation of a Java Processor

1. Introduction

In this paper,we will present the evaluation results for a Java processor , with respect to size and performance. This Java processor is called JOP – which stands for Java Optimized Processor –, based on the assumption that a full native implementation of all Java Virtual Machine (JVM) bytecode instructions is not a useful approach. JOP is a Java processor for embedded real-time systems, in particular a small processor for resource constraine　devices with time-predictable execution of Java programs.

Table I lists the relevant Java processors available to date. Sun introduced the first version of picoJava in 1997. Sun’s picoJava is the Java processor most often cited in research papers. It is used as a reference for new Java processors and as the basis for research into improving various aspects of a Java processor. Ironically, this processor was never released as a product by Sun. A redesign followed in 1999, known as picoJava-II that is now freely available with a rich set of documentation. The architecture of picoJava is a stack-based CISC processor implementing 341 different instructions and is the most complex Java processor available. The processor can be implemented in about 440K gates.

AJile’s JEMCore is a direct-execution Java processor that is available as both an IP core and a stand alone processor . It is based on the 32-bit JEM2 Java chip developed by Rockwell-Collins. The processor contains 48KB zero wait state RAM and peripheral components. 16KB of the RAM is used for the writable control store. The remaining 32KB is used for storage of the processor stack.

Vulcan ASIC’s Moon processor is an implementation of the JVM to run in an FPGA. The execution model is the often-used mix of direct, microcode and trapped execution. A simple stack folding is implemented in order to reduce five memory cycles to three for instruction sequences like push-push-add. The Moon2 processor is available as an encrypted HDL source for Altera FPGAs or as VHDL or Verilog source code.

The Lightfoot 32-bit core is a hybrid 8/32-bit processor based on the Harvard architecture. Program memory is 8 bits wide and data memory is 32 bits wide. The core contains a 3-stage pipeline with an integer ALU, a barrel shifter and a 2-bit multiply step unit. According to DCT, the performance is typically 8 times better than RISC interpreters running at the same clock speed.

Komodo is a multithreaded Java processor with a four-stage pipeline. It is intended as a basis for research on real-time scheduling on a multithreaded microcontroller. The unique feature of Komodo is the instruction fetch unit with four independent program counters and status flags for four threads. A priority manager is responsible for hardware real-time scheduling and can select a new thread after each bytecode instruction.

FemtoJava is a research project to build an application specific Java processor. The bytecode usage of the embedded application is analyzed and a customized version of FemtoJava is generated in order to minimize the resource usage. Femto-Java is not included in Section IV, as the processor could not run even the simplest benchmark.

Besides the real Java processors a few FORTH chips (Cjip　PSC1000) are marketed as Java processors. Java coprocessors (Jazelle JSTAR) provide Java execution speedup for general-purpose processors.

From the Table I we can see that JOP is the smallest realization of a hardware JVM in an FPGA and also has the highest clock frequency.

In the following section, a brief overview of the architecture of JOP is given, followed by a more detailed description of the microcode. Section III compares JOP’s resource usage with other soft-core processors. In the Section IV, a number of different solutions for embedded Java are compared at the bytecode level and at the application level.

2．Jop Architecture

JOP is a stack computer with its own instruction set, called microcode in this paper. Java bytecodes are translated into microcode instructions or sequences of microcode. The difference between the JVM and JOP is best described as the following:

The JVM is a CISC stack architecture, whereas JOP is a RISC stack architecture.

Figure 1 shows JOP’s major function units. A typical configuration of JOP contains the processor core, a memory interface and a number of IO devices.

The processor core contains the three microcode pipeline stages microcode fetch, decode and execute and an additional translation stage bytecode fetch. The module called extension provides the link between the processor core, and the memoryand IO modules. The ports to the other modules are the address and data bus for the bytecode instructions, the two top elements of the stack (A and B), input to the top-of-stack (Data) and a number of control signals. There is no direct connection between the processor core and the external world.

The memory interface provides a connection between the main memory and the processor core. It also contains the bytecode cache. The extension module controls data read and write. The busy signal is used by the microcode instruction wait to synchronize the processor core with the memory unit. The core reads bytecode instructions through dedicated buses (BC address and BC data) from the memory subsystem

The extension module performs three functions: (a) it contains hardware accelerators (such as the multiplier unit in this example), (b) the control for the memory and the I/O module, and (c) the multiplexer for the read data that is loaded into the top-of-stack register. The write data from the top-of-stack (A) is connected directly to all modules.

A. The Processor Pipeline

JOP is a fully pipelined architecture with single cycle execution of microcode instructions and a novel approach to mapping Java bytecode to these instructions. Figure 2 shows the datapath for JOP.

Three stages form the JOP core pipeline, executing microcode instructions. An additional stage in the front of the core pipeline fetches Java bytecodes – the instructions of the JVM – and translates these bytecodes into addresses in microcode. Bytecode branches are also decoded and executed in this stage. The second pipeline stage fetches JOP instructions from the internal microcode memory and executes microcode branches. Besides the usual decode function, the third pipeline stage also generates addresses for the stack RAM. As every stack machine instruction has either pop or push characteristics, it is possible to generate fill or spill addresses for the following instruction at this stage. The last pipeline stage performs ALU operations, load, store and stack spill or fill. At the execution stage, operations are performed with the two topmost elements of the stack.

A stack machine with two explicit registers for the two topmost stack elements and automatic fill/spill needs neither an extra write-back stage nor any data forwarding. Details of this two-level stack architecture . The short pipeline results in short branch delays. Therefore, a hard to analyze, with respect to Worst Case Execution Time (WCET), branch prediction logic can be avoided.

B. Interrupt Logic

Interrupts are considered hard to handle in a pipelined processor, meaning implementation tends to be complex (and therefore resource consuming). In JOP, the bytecodemicrocode translation is used cleverly to avoid having to handle interrupts in the core pipeline.

Interrupts are implemented as special bytecodes. These bytecodes are inserted by the hardware in the Java instruction stream. When an interrupt is pending and the next fetched byte from the bytecode cache is an instruction, the associated special bytecode is used instead of the instruction from the bytecode cache. The result is that interrupts are accepted at bytecode boundaries. The worst-case preemption delay is the execution time of the slowest bytecode that is implemented in microcode. Bytecodes that are implemented in Java (see Section II-D) can be interrupted.

The implementation of interrupts at the bytecode-microcode mapping stage keeps interrupts transparent in the core pipeline and avoids complex logic. Interrupt handlers can be implemented in the same way as standard bytecodes are implemented i.e. in microcode or Java. This special bytecode can result in a call of a JVM internal method in the context of the interrupted thread. This mechanism implicitly stores almost the complete context of the current active thread on the stack.

C. Cache

A pipelined processor architecture calls for higher memory bandwidth. A standard technique to avoid processing bottlenecks due to the higher memory bandwidth is caching. However, standard cache organizations improve the average execution time but are difficult to predict for WCET analysis. Two time-predictable caches are proposed for JOP: a stack cache as a substitution for the data cache and a method cache to cache the instructions.

As the stack is a heavily accessed memory region, the stack – or part of it – is placed in on-chip memory. This part of the stack is referred to as the stack cache and described . Fill and spill of the stack cache is subjected to microcode control and therefore time-predictable.

a novel way to organize an instruction cache, as method cache, is given. The cache stores complete methods, and cache misses only occur on method invocation and return. Cache block replacement depends on the call tree, instead of instruction addresses. This method cache is easy to analyze with respect to worst-case behavior and still provides substantial performance gain when compared against a solution without an instruction cache.

D. Microcode

The following discussion concerns two different instruction sets: bytecode and microcode. Bytecodes are the instructions that make up a compiled Java program. These instructions are executed by a Java virtual machine. The JVM does not assume any particular implementation technology. Microcode is the native instruction set for JOP. Bytecodes are translated, during their execution, into JOP microcode. Both instruction sets are designed for an extended2 stack machine.

(1).Translation of Bytecodes to Microcode: To date, no hardware implementation of the JVM exists that is capable of executing all bytecodes in hardware alone. This is due to the following: some bytecodes, such as new, which creates and initializes a new object, are too complex to implement in hardware. These bytecodes have to be emulated by software.

To build a self-contained JVM without an underlying operating system, direct access to the memory and I/O devices is necessary. There are no bytecodes defined for low-level access. These low-level services are usually implemented in native functions, which mean that another language (C) is native to the processor. However, for a Java processor, bytecode is the native language.

One way to solve this problem is to implement simple bytecodes in hardware and to emulate the more complex and native functions in software with a different instruction set (sometimes called microcode). However, a processor with two different instruction sets results in a complex design.

Another common solution, used in Sun’s picoJava , is to execute a subset of the bytecode native and to use a software trap to execute the remainder. This solution entails an overhead (a minimum of 16 cycles in picoJava) for the software trap.

In JOP, this problem is solved in a much simpler way. JOP has a single native instruction set, the so-called microcode. During execution, every Java bytecode is translated to either one, or a sequence of microcode instructions. This translation merely adds one pipeline stage to the core processor and results in no execution overheads. With this solution, we are free to define the JOP instruction set to map smoothly to the stack architecture of the JVM, and to find an instruction coding that can be implemented with minimal hardware.

Figure 3 gives an example of this data flow from the Java program counter to JOP microcode. The fetched bytecode acts as an index for the jump table. The jump table contains the start addresses for the JVM implementation in microcode. This address is loaded into the JOP program counter for every bytecode executed.

Every bytecode is translated to an address in the microcode that implements the JVM. If there exists an equivalent microinstruction for the bytecode, it is executed in one cycle and the next bytecode is translated. For a more complex bytecode, JOP just continues to execute microcode in the subsequent cycles. The end of this sequence is coded in the microcode instruction (as the nxt bit).

(2).Compact Microcode: For the JVM to be implemented efficiently, the microcode has to fit to the Java bytecode. Since the JVM is a stack machine, the microcode is also stack oriented. However, the JVM is not a pure stack machine. Method parameters and local variables are defined as locals. These locals can reside in a stack frame of the method and are accessed with an offset relative to the start of this locals area. Additional local variables are available at the microcode level. These variables serve as scratch variables, like registers in a conventional CPU. However, arithmetic and logic operations are performed on the stack.

Some bytecodes, such as ALU operations and the short form access to locals, are directly implemented by an equivalent microcode instruction (with a different encoding). Additional instructions are available to access internal registers, main memory and I/O devices. A relative conditional branch (zero/non zero of TOS) performs control flow decisions at the microcode level. For optimum use of the available memory resources, all instructions are 8 bits long. There are no variable-length instructions and every instruction, with the exception of wait, is executed in a single cycle. To keep the instruction set this dense, two concepts are applied:

Two types of operands, immediate values and branch distances, normally force an instruction set to be longer than 8 bits. The instruction set is either expanded to 16 or 32 bits, as in typical RISC processors, or allowed to be of variable length at byte boundaries. A first implementation of the JVM with a 16-bit instruction set showed that only a small number of different constants are necessary for immediate values and relative branch distances.

In the current realization of JOP, the different immediate values are collected while the microcode is being assembled and are put into the initialization file for the local RAM. These constants are accessed indirectly in the same way as the local variables. They are similar to initialized variables, apart from the fact that there are no operations to change their value during runtime, which would serve no purpose and would waste instruction codes.

A similar solution is used for branch distances. The assembler generates a VHDL file with a table for all found branch constants. This table is indexed using instruction bits during runtime. These indirections during runtime make it possible to retain an 8-bit instruction set, and provide 16 different immediate values and 32 different branch constants. For a general purpose instruction set, these indirections would impose too many restrictions. As the microcode only implements the JVM, this solution is a viable option.

To simplify the logic for instruction decoding, the instruction coding is carefully chosen. For example, one bit in the instruction specifies whether the instruction will increment or decrement the stack pointer. The offset to access the locals is directly encoded in the instruction. This is not the case for the original encoding of the equivalent bytecodes (e.g. iload 0 is 0x1a and iload 1 is 0x1b).

(3)Flexible Implementation of Bytecodes: As mentioned above, some Java bytecodes are very complex. One solution already described is to emulate them through a sequence of microcode instructions. However, some of the more complex bytecodes are very seldom used. To further reduce the resource implications for JOP, in this case local memory, bytecodes can even be implemented by using Java bytecodes. During the assembly of the JVM, all labels that represent an entry point for the bytecode implementation are used to generate the translation table. For all bytecodes for which no such label is found, i.e. there is no implementation in microcode, a notimplemented address is generated. The instruction sequence at this address invokes a static method from a system class. This class contains 256 static methods, one for each possible bytecode, ordered by the bytecode value. The bytecode is used as the index in the method table of this system class. This feature also allows for the easy configuration of resource usage versus performance.

3. Resource Usage

Cost, alongside energy consumption, is an important issue for embedded systems. The cost of a chip is directly related to the die size (the cost per die is roughly proportional to the square of the die area ). Chips with fewer gates also consume less energy. Processors for embedded systems are therefore optimized for minimum chip size.

One major design objective in the development of JOP was to create a small system that could be implemented in a low cost FPGA. Table II shows the resource usage for different configurations of JOP and different soft-core processors implemented in an Altera EP1C6 FPGA . Estimating equivalent gate counts for designs in an FPGA is problematic. It is therefore better to compare the two basic structures, Logic Cells (LC) and embedded memory blocks.

All configurations of JOP contain a memory interface to a 32-bit static RAM and an 8-bit FLASH for the Java program and the FPGA configuration data. The minimum configuration implements multiplication and the shift operations in microcode. In the basic configuration, these operations are implemented as a sequential Booth multiplier and a singlecycle barrel shifter. The typical configuration also contains some useful I/O devices such as an UART and a timer with interrupt logic for multi-threading. The typical configuration of JOP needs about 30% of the LCs in a Cyclone EP1C6, thus leaving enough resources free for application-specific logic.

As a reference, NIOS , Altera’s popular RISC soft-core, is also included in the list. NIOS has a 16-bit instruction set, a 5-stage pipeline and can be configured with a 16 or 32-bit datapath. Version A is the minimum configuration of NIOS. Version B adds an external memory interface, multiplication support and a timer. Version A is comparable with the minimal configuration of JOP, and Version B with its typical configuration.

SPEAR(Scalable Processor for Embedded Applications in Real-time Environments) is a 16-bit processor with deterministic execution times. SPEAR contains predicated instructions to support single-path programming . SPEAR is included in the list as it is also a processor designed for real-time systems.

To prove that the VHDL code for JOP is as portable as possible, JOP was also implemented in a Xilinx Spartan-3 FPGA [26]. Only the instantiation and initialization code for the on-chip memories is vendor-specific, whilst the rest of the VHDL code can be shared for the different targets. JOP consumes about the same LC count (1844 LCs) in the Spartan device, but has a slower clock frequency (83MHz).

From this comparison we can see that we have achieved our objective of designing a small processor. The commercial Java processor, Lightfoot, is 2.3 times larger (and 2.5 times slower) than JOP in the basic configuration. A typical 32-bit RISC processor consumes about 1.6 to 1.8 times the resources of JOP. However, the RISC processor can be clocked 20% faster than JOP in the same technology. The only processor that is similar in size is SPEAR. However, while SPEAR is a 16-bit processor, JOP contains a 32-bit data path.

Table III provides gate count estimates for JOP, picoJava, the aJile processor, and the Intel Pentium MMX processor that is used in the benchmarks in the next section. Equivalent gate count for an LC5 varies between 5.5 and 7.4 – we chose a factor of 6 gates per LC and 1.5 gates per memory bit for the estimated gate count for JOP in the table. JOP is listed in the typical configuration that consumes 1831 LCs. The Pentium MMX contains 4.5M transistors [27] that are equivalent to 1125K gates.

We can see from the table that the on-chip memory dominates the overall gate count of JOP, and to an even greater extent, of the aJile processor. The aJile processor is about 12 times larger than JOP.

4.Performance

Running benchmarks is problematic, both generally and especially in the case of embedded systems. The best benchmark would be the application that is intended to run on the system being tested. To get comparable results SPEC provides benchmarks for various systems. However, the one for Java, the SPECjvm98 [28], is usually too large for embedded systems.

Due to the absence of a standard Java benchmark for embedded systems, a small benchmark suit that should run on even the smallest device is provided here. It contains several micro-benchmarks for evaluating the number of clock cycles for single bytecodes or short sequences of bytecodes, and two application benchmarks. To provide a realistic workload for embedded systems, a real-time application was adapted to create the first application benchmark (Kfl). The application is taken from one of the nodes of a distributed motor control system . A simulation of both the environment (sensors and actors) and the communication system (commands from the master station) forms part of the benchmark, so as to simulate the real-world workload. The second application benchmark is an adaptation of a tiny TCP/IP stack for embedded Java. This benchmark contains two UDP server/clients, exchanging messages via a loopback device.

As we will see, there is a great variation in processing power across different embedded systems. To cater for this variation, all benchmarks are ‘self adjusting’. Each benchmark consists of an aspect that is benchmarked in a loop. The loop count adapts itself until the benchmark runs for more than a second. The number of iterations per second is then calculated, which means that higher values indicate better performance.

All the benchmarks measure how often a function is executed per second. In the Kfl benchmark, this function contains the main loop of the application that is executed in a periodic cycle in the original application. In the benchmark the wait for the next period is omitted, so that the time measured solely represents execution time. The UDP benchmark contains the generation of a request, transmitting it through the UDP/IP stack, generating the answer and transmitting it back as a benchmark function. The iteration count is the number of received answers per second.

The following list gives a brief description of the Java systems that were benchmarked:

JOP is implemented in a Cyclone FPGA, running at 100MHz. The main memory is a 32-bit SRAM (15ns) with an access time of 2 clock cycles. The benchmarked configuration of JOP contains a 4KB method cache [ organized in 16 blocks.

LeJOS As an example for a low-end embedded device we use the RCX robot controller from the LEGO MindStorms series. It contains a 16-bit Hitachi H8300 microcontroller [30], running at 16MHz. leJOS is a tiny interpreting JVM for the RCX.

TINI is an enhanced 8051 clone running a software JVM. The results were taken from a custom board with a 20MHz crystal, and the chip’s PLL is set to a factor of 2.

KVM is a port of the Sun’s KVM that is part of the Connected Limited Device Configuration (CLDC) to Alteras NIOS II processor on MicroC Linux. NIOS is implemented on a Cyclone FPGA and clocked with 50MHz. Besides the different clock frequency this is a good comparison of an interpreting JVM running in the same FPGA as JOP.

The benchmark results of Komodo were obtained by Matthias Pfeffer on a cycle-accurate simulation of Komodo. aJile’s JEMCore is a direct-execution Java processor that is available in two different versions: the aJ80 and the aJ100 . A development system, the JStamp , contains the aJ80 with an 8-bit memory, clocked at 74MHz. The SaJe board from Systronix contains an aJ100 that is clocked with 103MHz and contains 10ns 32-bit SRAM.

The EJC (Embedded Java Controller) platform is a typical example of a JIT system on a RISC processor. The system is based on a 32-bit ARM720T processor running at 74MHz. It contains up to 64 MB SDRAM and up to 16 MB of NOR flash.

GCJ is the GNU compiler for Java. This configuration represents the batch compiler solution, running on a 266MHz Pentium under Linux.

MB is the realization of Java on a RISC processor for an FPGA (Xilinx MicroBlaze ). Java is compiled to C with a Java compiler for real-time systems and the C program is compiled with the standard GNU toolchain.

In Figure 4, the geometric mean of the two application benchmarks is shown. The unit used for the result is iterations per second. Note that the vertical axis is logarithmic, in order to obtain useful figures to show the great variation in performance. The top diagram shows absolute performance, while the bottom diagram shows the same results scaled to a 1MHz clock frequency. The results of the application benchmarks and the geometric mean are shown in Table IV.

It should be noted that scaling to a single clock frequency could prove problematic. The relation between processor clock frequency and memory access time cannot always be maintained. To give an example, if we were to increase the results of the 100MHz JOP to 1GHz, this would also involve reducing the memory access time from 15ns to 1.5ns. Processors with 1GHz clock frequency are already available, but the fastest asynchronous SRAM to date has an access time of 10ns.

A. Discussion

When comparing JOP and the aJile processor against leJOS, TINI, and KVM, we can see that a Java processor is up to 500 times faster than an interpreting JVM on a standard processor for an embedded system. The average performance of JOP is even better than a JIT-compiler solution on an embedded system, as represented by the EJC system.

Even when scaled to the same clock frequency, the compiling JVM on a PC (gcj) is much faster than either embedded solution. However, the kernel of the application is smaller than 4KB . It therefore fits in the level one cache of the Pentium MMX (16KB + 16KB). For a comparison with a Pentium class processor we would need a larger application.

JOP is about 7 times faster than the aJ80 Java processor on the popular JStamp board. However, the aJ80 processor only contains an 8-bit memory interface, and suffers from this bottleneck. The SaJe system contains the aJ100 with 32-bit, 10ns SRAMs and is a about 10% slower than JOP with its 15ns SRAMs.

The MicroBlaze system is a representation of a Java batchcompilation system for a RISC processor. MicroBlaze is configured with the same cache6 as JOP and clocked at the same frequency. JOP is about four times faster than this solution, thus showing that native execution of Java bytecodes is faster than batch-compiled Java on a similar system. However, the results of the MicroBlaze solution are at a preliminary stage7, as the Java2C compiler is still under development.

The micro-benchmarks are intended to give insight into the implementation of the JVM. In Table V, we can see the execution time in clock cycles of various bytecodes. As almost all bytecodes manipulate the stack, it is not possible to measure the execution time for a single bytecode. As a minimum requirement, a second instruction is necessary to reverse the stack operation. For compiling versions of the JVM, these micro-benchmarks do not produce useful results. The compiler performs optimizations that make it impossible to measure execution times at this fine a granularity.

For JOP we can deduce that the WCET for simple bytecodes is also the average execution time. We can see that the combination of iload and iadd executes in two cycles, which means that each of these two operations is executed in a single cycle. The iinc bytecode is one of the few instructions that do not manipulate the stack and can be measured alone. As iinc is not implemented in hardware, we have a total of 11 cycles that are executed in microcode. It is fair to assume that this comprises too great an overhead for an instruction that is found in every iterative loop with an integer index. However, the decision to implement this instruction in microcode was derived from the observation that the dynamic instruction count for iinc is only 2％.

The sequence for the branch benchmark (if icmplt) contains the two load instructions that push the arguments onto the stack. The arguments are then consumed by the branch instruction. This benchmark verifies that a branch requires a constant four cycles on JOP, whether it is taken or not.

During the evaluation of the aJile system, unexpected behavior was observed. The aJ80 on the JStamp board is clocked at 7.3728MHz and the internal frequency can be set with a PLL. The aJ80 is rated for 80MHz and the maximum PLL factor that can be used is therefore ten. Running the benchmarks with different PLL settings gave some strange results. For example, with a PLL multiplier setting of ten, the aJ80 was about 12.8 times faster! Other PLL factors also resulted in a greater than linear speedup. The only explanation we could find was that the internal time　used for the benchmarks depends on the PLL setting. A comparison with the wall clock time showed that the internal time of the aJ80 is 23% faster with a PLL factor of 1 and 2.4％　faster with a factor of ten – a property we would not expect on a processor that is marketed for real-time systems. The SaJe board can also suffer from the problem described.

B. Execution Time Jitter

For real-time systems, the worst-case of the execution time is of primary importance. We have measured the execution times of several iterations of the main function from the Kfl benchmark. Figure 5 shows the measurements, scaled to the minimum execution time.

A period of four iterations can be seen. This period results from simulating the commands from the base station that are executed every fourth iteration. At iteration 10, a command to start the motor is issued. We see the resulting rise in execution time at iteration 12 to process this command. At iteration 54, the simulation triggers the end sensor and the motor is stopped.

The different execution times in the different modes of the application are inherent in the design of the simulation. However, the ratio between the longest and the shortest period is five for the JStamp, four for the gcj system and only three for JOP. Therefore, a system with an aJile processor needs to be 1.7 times faster than JOP in order to provide the same WCET for this measurement. At iteration 33, we can see a higher execution time for the JStamp system that is not seen on JOP. This variation at iteration 33 is not caused by the benchmark.

The execution time under gcj on the Linux system showe some very high peaks (up to ten times the minimum, not shown in the figures). This observation was to be expected, as the gcj/Linux system is not a real-time solution. The Sun JITsolution was also measured, but is omitted from the figure. As a result of the invocation of the compiler at some point during the simulation, the worst-case ratio between the maximum and minimum execution time was 1313 – showing that a JITcompiler is impractical for real-time applications.

It should be noted that execution time measurement is not a safe method for obtaining WCET estimates. However, in situations where no WCET analysis tool is available, it can give some insight into the WCET behavior of different systems.

本文来源：https://www.2haoxitong.net/k/doc/54a336a9360cba1aa911da4a.html