Implicit directory state updates

文档序号：923462 发布日期：2021-03-02 浏览：2次中文

阅读说明：本技术 隐式目录状态更新 (Implicit directory state updates ) 是由 R.G.布兰肯希普于 2016-02-26 设计创作，主要内容包括：通过链路接收请求存储器中的特定行的请求。在存储器中识别目录状态记录,该目录状态记录识别该特定行的目录状态。根据该请求来识别请求类型。基于该特定行的目录状态和请求类型来确定该特定行的目录状态要从特定状态改为新状态。响应于该请求的接收,改变目录状态记录以反映新状态。响应于该请求而发送该特定行的副本。(A request is received over a link requesting a particular row in memory. A directory state record is identified in the memory, the directory state record identifying a directory state for the particular line. A request type is identified from the request. The directory state of the particular line is determined to be changed from the particular state to a new state based on the directory state of the particular line and the request type. In response to receipt of the request, the directory state record is changed to reflect the new state. A copy of the particular line is sent in response to the request.)

1. An apparatus, comprising:

a buffer device;

a processor block to perform image operations and coupled to the buffer device through a memory access link;

a memory controller coupled to a memory module device comprising a far memory module considered local by the processor block and a near memory module serving as a cache for the far memory module;

an on-chip interconnect fabric to carry data information, the on-chip interconnect fabric configured to couple to an I/O interface and/or an I/O interconnect of an I/O device;

coherency logic to perform a memory coherency operation; and

a graphics domain associated with the processor block.

2. The apparatus of claim 1, wherein the buffer device is to store memory protocol packets transmitted from the memory module device to the processor blocks.

3. The apparatus of claim 2, wherein credit-based flow control is used to implement a virtual channel for the memory protocol packets.

4. The apparatus of any of claims 1-3, wherein the buffered memory protocol used by the memory access link maintains coherency states including an exclusive state, a modified state, a shared state, and an invalid state.

5. The apparatus of claim 4, wherein the buffered memory protocols share different message classes.

6. The apparatus of claim 1, wherein the processor block is to be coupled to another processor block via an HPI compatible link.

7. The apparatus of claim 1, wherein the coherency logic comprises a directory to maintain coherency states comprising an exclusive state, a modified state, a shared state, and an invalid state.

8. An apparatus for communicating data, the apparatus comprising:

a memory controller to:

receiving a read request over a link, wherein the read request requests a particular row in a memory;

identifying a directory state record in memory, wherein the directory state record identifies a directory state of the particular line;

identifying a type of the read request from the read request;

determining that the directory state of the particular line is to be changed from the particular state to a new state based on the read request, wherein the change is to be determined according to the directory state of the particular line and a type of the read request; and

in response to receipt of the read request, changing the directory state record to reflect the new state; and

a transmitter to send a copy of the particular row in response to the read request.

9. A software-based controller for implementing one or more states, the states being states of a state machine used by agents in a High Performance Interconnect (HPI), the controller configured to:

generating one or more faults or state exits that cause a state reset as the agent progresses through the state machine toward an operational link state during initialization or re-initialization of a link;

resetting a counter for the link upon the link successfully entering a link state, the counter being maintained by agents on both sides of the link;

suspending initialization of the link when the count of non-productive cycles meets or exceeds a defined threshold;

wherein the state includes one or more of an event reset state, a timing reset state, a calibration reset state, a polling state, a link width state, and a low power link state.

10. The controller of claim 9, wherein the controller is further configured to: storing previous state results and configurations for the link and reusing the previous state results and configurations in subsequent initializations and configurations of the link.

Technical Field

The present disclosure relates generally to the field of computing architectures, and more particularly to a cache memory protocol.

Background

Advances in semiconductor processing and logic design have allowed an increase in the amount of logic that may be present on an integrated circuit device. As a corollary, computer system configurations have evolved from a single or multiple integrated circuits in a system to multiple cores, multiple hardware threads, and multiple logical processors present on each integrated circuit, as well as other interfaces integrated within such processors. A processor or integrated circuit typically includes a single physical processor die, where the processor die may include any number of cores, hardware threads, logical processors, interfaces, memories, controller hubs, and the like.

Smaller computing devices have increased in popularity due to their greater ability to pack more processing power in smaller packages. Smart phones, tablets, ultra-thin notebooks, and other user devices have grown exponentially. However, these smaller devices rely on servers for both complex processing and data storage beyond the form factor. Thus, the demand in the high performance computing market (i.e., server space) has also increased. For example, in modern servers, there is typically not only a single processor with multiple cores, but also multiple physical processors (also referred to as multiple sockets) to increase computing power. But as processing power grows with the number of devices in a computing system, communication between the slots and other devices becomes more critical.

In fact, interconnects have grown from more traditional multi-drop buses that originally handled electrical communications to fully developed interconnect architectures that facilitate fast communications. Unfortunately, with the demand for future processors to consume at even higher rates, the capabilities of existing interconnect architectures place a premium on corresponding demands.

Drawings

FIG. 1 illustrates a simplified block diagram of a system including a serial point-to-point interconnect to connect I/O devices in a computer system, according to one embodiment.

Figure 2 illustrates a simplified block diagram of a layered protocol stack in accordance with one embodiment.

FIG. 3 illustrates an embodiment of a transaction descriptor.

Fig. 4 illustrates an embodiment of a serial point-to-point link.

FIG. 5 illustrates an embodiment of a potential High Performance Interconnect (HPI) system configuration.

Fig. 6 illustrates an embodiment of a layered protocol stack associated with HPI.

FIG. 7 illustrates a representation of an example state machine.

Fig. 8 illustrates a representation of an example flit (flit) transmitted over an example 20-lane data link.

FIG. 9 illustrates a representation of an example flit being sent over an example 8 lane data link.

Fig. 10 illustrates a representation of an example multi-slot (multi-slot) microchip.

FIG. 11 illustrates a representation of an example flit being sent over an example 8 lane data link.

FIG. 12 illustrates a representation of an example multi-slot flit including a debug message.

FIG. 13 illustrates a representation of a system employing a buffer memory protocol.

FIG. 14 illustrates a representation of a buffer device supporting a buffered memory protocol.

15A-15C illustrate representations of example flits in a buffer memory protocol.

16A-16C illustrate example communications between a buffer device and a host device according to some implementations.

FIG. 17 illustrates a representation of an example embodiment of a cache line according to some implementations.

FIG. 18 illustrates an embodiment of a block diagram of a computing system including multiple processor sockets.

FIG. 19 illustrates another embodiment of a block diagram of a computing system including a multicore processor.

Like reference numbers and designations in the various drawings indicate like elements.

Detailed Description

In the following description, numerous specific details are set forth, such as the following examples: specific processor types and system configurations, specific hardware structures, specific architectural and microarchitectural details, specific register configurations, specific instruction types, specific system components, specific processor pipeline stages, specific interconnect layers, specific packet/transaction configurations, specific transaction names, specific protocol exchanges, specific link widths, specific implementations, and operations, etc. in order to provide a thorough understanding of the present invention. It may be evident, however, to one skilled in the art that these specific details need not be employed to practice the presently disclosed subject matter. In other instances, very detailed descriptions of well-known components or methods, such as specific and alternative processor architectures, specific logic circuits/code for described algorithms, specific firmware code, low-level interconnect operations, specific logic configurations, specific manufacturing techniques and materials, specific compiler implementations, specific expressions of algorithms in code, specific power down and gating techniques/logic, and other specific operational details of computer systems, have been avoided in order to avoid unnecessarily obscuring the present disclosure.

Although the following embodiments may be described with reference to energy conservation, energy efficiency, processing efficiency, etc., in a particular integrated circuit, such as in a computing platform or microprocessor, other embodiments may be applied to other types of integrated circuits and logic devices. Similar techniques and teachings of the embodiments described herein may be applied to other types of circuits or semiconductor devices that may also benefit from such features. For example, the disclosed embodiments are not limited to server computer systems, desktop computer systems, kneesUpper computer and Ultrabook^TM(ultrabook) but may also be used in other devices such as handheld devices, smart phones, tablets, other thin notebooks, system on a chip (SOC) devices, and embedded applications. Some examples of handheld devices include cellular telephones, internet protocol devices, digital cameras, Personal Digital Assistants (PDAs), and handheld PCs. Here, similar techniques for high performance interconnects may be applied to increase performance (or even save power) in low power interconnects. Embedded applications typically include microcontrollers, Digital Signal Processors (DSPs), systems on a chip, network computers (netpcs), set-top boxes, network hubs, Wide Area Network (WAN) switches, or any other system that can perform the functions and operations taught below. Furthermore, the apparatus, methods, and systems described herein are not limited to physical computing devices, but may also relate to software optimization for energy conservation and efficiency. As may become readily apparent in the following description, embodiments of the methods, devices, and systems described herein (whether with reference to hardware, firmware, software, or a combination thereof) may be considered critical to the "green technology" future in trade-off with performance considerations.

As computing systems continue to advance, the components therein are becoming more complex. The interconnect architecture that couples and communicates between components has also increased in complexity to ensure that bandwidth requirements are met for optimal component operation. Furthermore, different market segments require different aspects of interconnect architecture to accommodate the respective markets. For example, servers require higher performance, and mobile ecosystems can sometimes sacrifice overall performance to save power. However, the only purpose of most configurations is to provide the highest possible performance with maximum power savings. Moreover, a variety of different interconnects may potentially benefit from the subject matter described herein.

Peripheral component interconnect express (PCI) (PCIe) interconnect fabric architecture and Quick Path Interconnect (QPI) fabric architecture, among other examples, can potentially be improved according to one or more of the principles described herein, among other examples. For example, the primary goal of PCIe is to enable components and devices from different vendors to interoperate in an open architecture, spanning multiple market segments; clients (desktop and mobile), servers (standard and enterprise), and embedded and communication devices. PCI express is a high performance, general purpose I/O interconnect defined for a wide variety of future computing and communication platforms. Some PCI attributes, such as its usage model, load storage architecture, and software interfaces, have been maintained by its revisions, while previous parallel bus implementations have been replaced by highly scalable, fully serial interfaces. More recent versions of PCI express utilize advances in point-to-point interconnects, switch-based technologies, and packetized protocols to deliver new performance levels and features. Power management, quality of service (QoS), hot plug/hot switch support, data integrity, and error handling are among some of the advanced features supported by PCI express. Although the primary discussion herein references a new High Performance Interconnect (HPI) architecture, aspects of the invention described herein may be applied to other interconnect architectures, such as PCIe compliant architectures, QPI compliant architectures, MIPI compliant architectures, high performance architectures, or other known interconnect architectures.

Referring to fig. 1, an embodiment of a fabric comprised of point-to-point links interconnecting a set of components is illustrated. System 100 includes a processor 105 and a system memory 110 coupled to a controller hub 115. Processor 105 may include any processing element, such as a microprocessor, a main processor, an embedded processor, a co-processor, or other processor. Processor 105 is coupled to controller hub 115 through a Front Side Bus (FSB) 106. In one embodiment, the FSB 106 is a serial point-to-point interconnect as described below. In another embodiment, link 106 includes a serial, differential interconnect architecture that is compliant with different interconnect standards.

The system memory 110 includes any memory device, such as Random Access Memory (RAM), non-volatile (NV) memory, or other memory accessible by devices in the system 100. System memory 110 is coupled to controller hub 115 through memory interface 116. Examples of memory interfaces include a Double Data Rate (DDR) memory interface, a dual channel DDR memory interface, and a dynamic ram (dram) memory interface.

In one embodiment, controller hub 115 may comprise a root hub, root complex, or root controller, such as in a PCIe interconnect hierarchy. Examples of controller hub 115 include a chipset, a Memory Controller Hub (MCH), a northbridge, an Interconnect Controller Hub (ICH), a southbridge, and a root controller/hub. The term chipset often refers to two physically separate controller hubs, e.g., a Memory Controller Hub (MCH) coupled to an Interconnect Controller Hub (ICH). Note that current systems often include an MCH integrated with the processor 105, and the controller 115 will communicate with I/O devices in a similar manner as described below. In some embodiments, peer-to-peer routing is optionally supported by the root complex 115.

Here, controller hub 115 is coupled to switch/bridge 120 through serial link 119. Input/output modules 117 and 121, which may also be referred to as interfaces/ports 117 and 121, may include/implement a layered protocol stack to provide communication between controller hub 115 and switch 120. In one embodiment, a plurality of devices can be coupled to the switch 120.

Switch/bridge 120 routes packets/messages from device 125 upstream (i.e., above the hierarchy) toward the root complex to controller hub 115 and downstream (i.e., below the hierarchy away from the root controller), from processor 105 or system memory 110 to device 125. In one embodiment, switch 120 is referred to as a logical assembly of multiple virtual PCI to PCI bridge devices. Device 125 includes any internal or external device or component to be coupled to an electronic system, such as I/O devices, Network Interface Controllers (NICs), expansion cards, audio processors, network processors, hard drives, storage devices, CD/DVD ROMs, monitors, printers, mice, keyboards, routers, portable storage devices, firewire devices, Universal Serial Bus (USB) devices, scanners, and other input/output devices. Often in PCIe specials, such as devices, are referred to as endpoints. Although not specifically shown, device 125 may include a bridge (e.g., a PCIe to PCI/PCI-X bridge) to support legacy or other versions of the device or interconnect fabric supported by such devices.

Graphics accelerator 130 may also be coupled to controller hub 115 via serial link 132. In one embodiment, graphics accelerator 130 is coupled to an MCH, which is coupled to an ICH. Switch 120, and correspondingly I/O device 125, is then coupled to the ICH. I/O modules 131 and 118 will also implement a layered protocol stack to communicate between graphics accelerator 130 and controller hub 115. Similar to the MCH discussion above, a graphics controller or graphics accelerator 130 itself may be integrated into processor 105.

Turning to fig. 2, an embodiment of a layered protocol stack is illustrated. Layered protocol stack 200 may include any form of layered communication stack, such as a QPI stack, a PCIe stack, a next generation high performance computing interconnect (HPI) stack, or other layered stack. In one embodiment, protocol stack 200 may include a transaction layer 205, a link layer 210, and a physical layer 220. Interfaces such as interfaces 117, 118, 121, 122, 126, and 131 in fig. 1 may be represented as a communication protocol stack 200. A representation as a communication protocol stack may also be referred to as a module or interface implementing/including the protocol stack.

Packets may be used to communicate information between components. Packets may be formed in the transaction layer 205 and the data link layer 210 to convey information from the transmitting component to the receiving component. As the transmitted packets flow through the other layers, they are extended with additional information for processing the packets at those layers. The reverse process occurs at the receiving side and transforms the packet from its physical layer 220 representation to a data link layer 210 representation and ultimately (for transaction layer packets) to a form that can be processed by the transaction layer 205 of the receiving device.

In one embodiment, the transaction layer 205 may provide an interface between the processing cores of the device and the interconnect architecture (such as the data link layer 210 and the physical layer 220). In this regard, the primary responsibility of the transaction layer 205 may include the assembly and disassembly of packets (i.e., transaction layer packets or TLPs). Translation layer 205 may also manage credit-based flow control for TLPs. In some implementations, split transactions, i.e., transactions with requests and responses divided by time, may be utilized, allowing the link to carry other traffic while the target device collects data for the responses, among other examples.

Credit-based flow control may be used to implement virtual channels and networks using interconnect fabrics. In one example, a device may advertise an initial credit amount for each receive buffer in the transaction layer 205. An external device at the opposite end of the link, such as controller hub 115 in FIG. 1, may count the number of credits consumed by each TLP. If the transaction does not exceed the credit limit, the transaction may be transmitted. The credits are restored upon receipt of the response. One example of an advantage of such a credit scheme is that if the credit limit is not counted, the latency of credit return does not affect performance, among other potential advantages.

In one embodiment, the four transaction address spaces may include a configuration address space, a memory address space, an input/output address space, and a message address space. The memory space transaction includes one or more of a read request and a write request to transfer data to or from a memory mapped location. In one embodiment, memory space transactions can use two different address formats, for example, a short address format (such as a 32-bit address) or a long address format (such as a 64-bit address). Configuration space transactions may be used to access configuration spaces of various devices connected to the interconnect. Transactions to the configuration space may include read requests and write requests. Message space transactions (or simply, messages) may also be defined to support in-band communication between interconnect agents. Thus, in an example embodiment, the transaction layer 205 may assemble the packet header/payload 206.

Referring quickly to fig. 3, an example embodiment of a transaction-level packet descriptor is illustrated. In one embodiment, the transaction descriptor 300 may be a mechanism for carrying transaction information. In this regard, the transaction descriptor 300 supports the identification of transactions in the system. Other potential uses include tracking modifications to default transaction ordering and association of transactions with channels. For example, the transaction descriptor 300 may include a global identifier field 302, an attribute field 304, and a channel identifier field 306. In the illustrated example, the global identifier field 302 is depicted as including a local transaction identifier field 308 and a source identifier field 310. In one embodiment, the global transaction identifier 302 is unique to all outstanding requests.

According to one implementation, the local transaction identifier field 308 is a field generated by the requesting agent and may be unique to all outstanding requests that require completion for that requesting agent. Further, in this example, source identifier 310 uniquely identifies the requestor agent within the interconnect hierarchy. Thus, along with the source ID 310, the local transaction identifier 308 field provides global identification of transactions within the hierarchy domain.

The attributes field 304 specifies the nature and relationship of the transaction. In this regard, the attribute field 304 is potentially used to provide additional information that allows modification of the default handling of the transaction. In one embodiment, the attribute fields 304 include a priority field 312, a reserved field 314, an order field 316, and a no-snoop (no-snoop) field 318. Here, the priority subfield 312 may be modified by the initiator to assign a priority to the transaction. The reserved attributes field 314 is reserved for future use or vendor defined use. The possible usage model using priority or security attributes may be implemented using reserved attribute fields.

In this example, the sort attributes field 316 is used to provide optional information that conveys the type of sort that may modify the default sort rules. According to one example implementation, an ordering attribute of "0" indicates that a default ordering rule is applied, where an ordering attribute of "1" indicates relaxed ordering, where writes may pass writes in the same direction, and read completions may pass writes in the same direction. Snoop attributes field 318 is used to determine whether a transaction is snooped. As shown, the channel ID field 306 identifies the channel with which the transaction is associated.

Returning to the discussion of fig. 2, link layer 210 (also referred to as data link layer 210) may act as an intermediate level between transaction layer 205 and physical layer 220. In one embodiment, it is the responsibility of the data link layer 210 to provide a reliable mechanism for exchanging Transaction Layer Packets (TLPs) between two components on a link. One side of the data link layer 210 accepts TLPs assembled by the transaction layer 205, applies the packet sequence identifier 211 (i.e., an identification number or packet number), computes and applies an error detection code (i.e., CRC 212), and presents the modified TLPs to the physical layer 220 for transmission across the physical to external devices.

In one example, physical layer 220 includes a logical sub-block 221 and an electrical sub-block 222 to physically transmit packets to an external device. Here, the logical sub-block 221 is responsible for the "digital" function of the physical layer 221. In this regard, the logical sub-block may include a transmit section to prepare outgoing information for transmission by the physical sub-block 222, and a receiver section to identify and prepare received information before it is delivered to the link layer 210.

The physical block 222 includes a transmitter and a receiver. The transmitter is provided by the logic sub-block 221 with symbols, which the transmitter serializes and transmits to an external device. The receiver is provided with serialized symbols from an external device and converts the received signal into a bit stream. The bit stream is de-serialized and provided to logical sub-block 221. In one example embodiment, an 8b/10b transmission code is employed, wherein 10-bit symbols are transmitted/received. Here, a special symbol is used to construct a packet with the frame 223. Further, in one example, the receiver also provides a symbol clock recovered from the incoming serial stream.

As described above, although the transaction layer 205, link layer 210, and physical layer 220 are discussed with reference to a particular embodiment of a protocol stack (such as a PCIe protocol stack), the layered protocol stack is not so limited. Indeed, any layered protocol may be included/implemented and employ the features discussed herein. As an example, a port/interface represented as a layered protocol may include: (1) a first layer, the transaction layer, to assemble the packets; a second layer, the link layer, to order the packets; and a third layer, i.e., a physical layer, to transmit the packet. As a specific example, a high performance interconnect layering protocol as described herein is utilized.

Referring next to FIG. 4, an example embodiment of a serial point-to-point architecture is illustrated. A serial point-to-point link may include any transmission path for transmitting serial data. In the illustrated embodiment, the link may include two low voltage differentially driven signal pairs: a transmit pair 406/411 and a receive pair 412/407. Thus, device 405 includes transmit logic 406 to transmit data to device 410, and receive logic 407 to receive data from device 410. In other words, in some implementations of the link include two transmission paths, paths 416 and 417, and two reception paths, paths 418 and 419.

A transmission path refers to any path for transmitting data, such as a transmission line, a copper line, an optical line, a wireless communication channel, an infrared communication link, or other communication path. A connection between two devices, such as device 405 and device 410, is referred to as a link, such as link 415. A link may support one lane — each lane represents a set of differential signal pairs (one pair for transmission and one pair for reception). To scale bandwidth, a link may aggregate multiple lanes, denoted by xN, where N is any supported link width, such as 1, 2, 4, 8, 12, 16, 20, 24, 32, 64 or wider.

A differential pair may refer to two transmission paths, such as lines 416 and 417, to transmit differential signals. As an example, line 417 drives from a high logic level to a low logic level (i.e., a falling edge) when line 416 toggles (toggle) from a low voltage level to a high voltage level (i.e., a rising edge). Differential signals potentially exhibit better electrical characteristics, such as better signal integrity (i.e., cross-coupling), voltage overshoot/undershoot, ringing, and other example advantages. This allows for a better timing window, which enables faster transmission frequencies.

In one embodiment, a new High Performance Interconnect (HPI) is provided. The HPI may include a next generation cache coherent, link-based interconnect. As one example, HPI may be used in high performance computing platforms, such as workstations or servers, included in systems where PCIe or another interconnect protocol is typically used to connect processors, accelerators, I/O devices, and the like. However, the HPI is not limited thereto. Alternatively, HPI may be used in any of the systems or platforms described herein. Moreover, the various concepts developed may be applied to other interconnects and platforms, such as PCIe, MIPI, ethernet, USB, QPI, and the like.

To support multiple devices, in one example implementation, the HPI may include Instruction Set Architecture (ISA) agnostic (i.e., the HPI can be implemented in multiple different devices). In another scenario, HPI may also be used to connect high performance I/O devices, not just processors or accelerators. For example, a high performance PCIe device may be coupled to the HPI through a suitable translation bridge (i.e., HPI to PCIe). Further, the HPI link may be utilized by many HPI-based devices (such as processors) in various ways (e.g., star, ring, mesh, etc.). FIG. 5 illustrates an example implementation of a number of potential multi-socket configurations. The dual socket configuration 505 as depicted may include two HPI links; however, in other implementations, one HPI link may be utilized. For larger topologies, any configuration may be utilized, as long as the Identifiers (IDs) are assignable and there is some form of virtual path, as well as other additional or alternative features. As shown, in one example, the four-socket configuration 510 has HPI links from each processor to another processor. In the eight-socket implementation shown in configuration 515, however, not every socket is directly connected to each other via an HPI link. However, if a virtual path or channel exists between the processors, this configuration is supported. The range of processors supported includes 2-32 processors in the native domain. Among other examples, higher processor counts may be achieved by using multiple domains or other interconnections between node controllers.

The HPI architecture includes a definition of a layered protocol architecture, including in some examples protocol layers (coherent, non-coherent, and optionally other memory-based protocols), a routing layer, a link layer, and a physical layer. Further, HPI may also include enhancements related to power managers such as Power Control Units (PCUs), design for test and Debug (DFT), error handling, registers, security, and other examples. Fig. 5 illustrates an embodiment of an example HPI layered protocol stack. In some implementations, at least some of the layers illustrated in fig. 5 may be optional. Each layer handles its own level of granularity or share of information (protocol layer 605a, b with packets 630, link layer 610a, b with flits 635, and physical layer 605a, b with physical flits (phit) 640). Note that a packet may include partial flits, a single flit, or multiple flits in some embodiments depending on implementation.

As a first example, the width of physical flit 640 comprises a 1-to-1 mapping of link width to bit (e.g., a 20-bit link width comprises a 20-bit physical flit, etc.). The flits can have larger sizes, such as 184, 192, or 200 bits. Note that if physical flit 640 is 20 bits wide and flit 635 is 184 bits in size, then a fraction of physical flit 640 is used to transmit one flit 635 (e.g., 184-bit flit 635 is transmitted with 9.2 physical flits at 20 bits or 192-bit flits are transmitted with 9.6 physical flits at 20 bits, among other examples). Note that the width of the base link may vary at the physical layer. For example, the number of channels per direction may include 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, etc. In one embodiment, the link layers 610a, b can embed multiple different transactions in a single flit, and one or more headers (e.g., 1, 2, 3, 4) can be embedded within the flit. In one example, HPI partitions headers into corresponding slots to enable multiple messages in a flit to different nodes.

In one embodiment, the physical layers 605a, b may be responsible for quickly transmitting information over a physical medium (electrical or optical, etc.). A physical link may be point-to-point between two link layer entities, such as layers 605a and 605 b. The link layers 610a, b may abstract the physical layers 605a, b from upper layers and provide the ability to reliably transfer data (and requests) and manage flow control between two directly connected entities. The link layer may also be responsible for virtualizing the physical channel into multiple virtual channels and message classes. The protocol layers 620a, b rely on the link layers 610a, b to map protocol messages into appropriate message classes and virtual channels before processing the protocol messages to the physical layers 605a, b for transmission across the physical link. The link layers 610a, b may support multiple messages, such as request, snoop, response, writeback, non-coherent data, among other examples.

The physical layers 605a, b (or PHY) of the HPI may be implemented above the electrical layer (i.e., the electrical conductors connecting the two components) and below the link layers 610a, b, as illustrated in fig. 6. The physical layer and corresponding logic may reside on each agent and connect the link layers on two agents (a and B) that are separate from each other (e.g., on devices on either side of the link). The local electrical layer and the remote electrical layer are connected by a physical medium (e.g., wires, conductors, optics, etc.). In one embodiment, the physical layers 605a, b have two main phases, initialization and operation. During initialization, the connection is not transparent to the link layer, and signaling may involve a combination of timing states and handshake events. During operation, the connection is transparent to the link layer and the signaling is at speed, where all lanes operate together as a single link. During the operational phase, the physical layer transfers flits from agent a to agent B and from agent B to agent a. The connection is also referred to as a link and extracts some physical aspects from the link layer, including medium, width, and speed, while exchanging flits and control/state (e.g., width) of the current configuration with the link layer. The initialization phase includes secondary phases such as polling, configuration. The operational phase also includes a secondary phase (e.g., link power management state).

In one embodiment, the link layers 610a, b may be implemented to provide reliable data transfer between two protocols or routing entities. The link layer may extract the physical layers 605a, b from the protocol layers 620a, b and may be responsible for flow control between the two protocol agents (A, B) and provide virtual channel services to the protocol layers (message classes) and the routing layer (virtual network). The interface between the protocol layers 620a, b and the link layers 610a, b may typically be at the packet level. In one embodiment, the smallest transmission unit at the link layer is referred to as a flit (flit), which is a specified number of bits, such as 192 bits or some other credit. The link layers 610a, b rely on the physical layers 605a, b to build the physical layer's 605a, b transmission units (physical flits) into the link layer's 610a, b transmission units (flits). Further, the link layers 610a, b may be logically divided into two parts, i.e., a transmitter and a receiver. A transmitter/receiver pair on one entity may be connected to a receiver/transmitter pair on another entity. Flow control is typically performed on both a flit and packet basis. Error detection and correction is also potentially performed on a microchip-level basis.

In one embodiment, the routing layers 615a, b may provide a flexible and distributed approach to routing HPI transactions from a source to a destination. The scheme is flexible in that routing algorithms for multiple topologies may be specified by programmable routing tables at each router (in one embodiment, programming is performed by firmware, software, or a combination thereof). The routing functionality may be distributed; routing may be accomplished through a series of routing steps, where each routing step is defined by a lookup of a table at any of the source, intermediate, or destination routers. A lookup at the source may be used to inject the HPI packet into the HPI fabric. A lookup at the intermediate router may be used to route the HPI packet from the input port to the output port. A lookup at the destination port may be used to target the destination HPI protocol proxy. Note that in some implementations, the routing layer may be thinner because the routing tables, and thus the routing algorithms, are not specifically defined by the specification. This allows flexibility and a variety of usage models, including the topology of the flexible platform architecture to be defined by the system implementation. The routing layers 615a, b rely on the link layers 610a, b to provide for the use of up to three (or more) Virtual Networks (VNs) -in one example, two deadlock free VNs, VN0 and VN1, have several classes of messages defined in each virtual network. A shared adaptive Virtual Network (VNA) may be defined in the link layer, but the adaptive network may not be directly exposed to the routing concept, as each message class and virtual network may have dedicated resources and guaranteed forward progress, among other features and examples.

In one embodiment, the HPI may include a coherency protocol layer 620a, b to support agents that cache lines of data from memory. An agent wishing to cache memory data may use a coherency protocol to read a line of data to load into its cache. An agent that wishes to modify a line of data in its cache may use a coherency protocol to acquire ownership of the line before modifying the data. After modifying the row, the agent may follow the following protocol requirements: it is held in its cache until the line is written back to memory or included in a response to an external request. Finally, the agent may satisfy the external request to invalidate the line in its cache. The protocol ensures data consistency by specifying rules that all caching agents can follow. It also provides a means for agents without caches to consistently read and write memory data.

Two conditions may be enforced to support transactions that utilize the HPI coherency protocol. First, the protocol may maintain data coherency among data in the proxy cache (e.g., on a per address basis) and between those data and data in memory. Informally, data coherency may refer to each valid data line in the physical cache that represents the latest value of the data, and the data transmitted in the coherency protocol packet may represent the latest value of the data when the data was sent. When no valid copy of the data exists in the cache or transfer, the protocol may ensure that the latest value of the data resides in memory. Second, the protocol may provide well-defined commit points for requests. The commit point of the read may indicate when the data is available; and for writes they may indicate when the data being written is globally visible and will be loaded by subsequent reads. The protocol may support these commit points for cacheable and non-cacheable (UC) requests in coherent memory space.

In some implementations, the HPI may utilize an embedded clock. The clock signal may be embedded in the data transmitted using the interconnect. With the clock signal embedded in the data, a different and dedicated clock channel may be omitted. This may be useful, for example, because it may allow more device pins to be dedicated to data transfer, particularly in systems where space for pins is at a premium.

A link may be established between two agents on either side of the interconnect. The agent sending the data may be a local agent and the agent receiving the data may be a remote agent. The state machine may be used by both agents to manage various aspects of the link. In one embodiment, the physical layer data path may transport the flits from the link layer to the electrical front end. In one implementation, the control path includes a state machine (also referred to as a link training state machine or the like). The actions of the state machine and the exit from the state may depend on internal signals, timers, external signals or other information. In practice, some states, such as a small number of initialization states, may have a timer to provide a timeout value to exit the state. Note that in some embodiments, detection refers to detecting events on both legs of a channel; but not necessarily simultaneously. However, in other embodiments, detecting refers to detecting an event by referencing an agent. As one example, debounce refers to the continuous assertion of a signal. In one embodiment, HPI supports operations in the event of a non-functional channel. Here, the channel may fall at a specific state.

States defined in the state machine may include reset states, initialization states, and operational states, as well as other classes and subcategories. In one example, some initialization states may have a secondary timer that is used to time out of the state (essentially aborting due to a failure to make progress in the state). The abort may include an update of a register, such as a status register. Some states may also have master timer(s) that are used to time master functions in the state. Other states may be defined such that an internal or external signal (such as a handshake protocol) drives a transition from that state to another state, among other examples.

The state machine may also support debugging through a single step, freezing of initialization aborts, and use of a tester. Here, the state exit may be delayed/maintained until the debug software is ready. In some instances, the exit may be delayed/maintained until the secondary timeout. In one embodiment, the actions and exits may be based on an exchange of training sequences. In one embodiment, the link state machine will run in the local proxy clock domain and transition from one state to the next to coincide with the transmitter training sequence boundary. The status register may be used to reflect the current state.

Fig. 7 illustrates a representation of at least a portion of a state machine used by an agent in one example implementation of HPI. It should be appreciated that the states included in the state table of FIG. 7 include a non-exhaustive list of possible states. For example, some transitions are omitted to simplify the illustration. Also, some states may be combined, divided, or omitted, while other states may be added. Such states may include:

event reset state: entered upon a warm reset or cold reset event. The default value is restored. A counter (e.g., a sync counter) is initialized. May exit to another state, such as another reset state.

Timed reset state: timing states for in-band reset. A predefined Electrically Ordered Set (EOS) may be driven so that the remote receiver is able to detect the EOS and also enter a timing reset. The receiver has a channel that maintains an electrical setting. The proxy may be exited to calibrate the reset state.

Calibrating reset state: calibration without signaling on the channel (e.g., receiver calibration status) or turning off the driver. May be based on a predetermined amount of time in the state of the timer. The operating speed may be set. May act as a wait state when a port is not enabled. A minimum dwell time may be included. Receiver adjustment or staggering (stager off) may be based on design. The receiver detection state may be exited after a timeout and/or completion of calibration.

Receiver detection state: the presence of a receiver on the channel(s) is detected. Receiver termination (e.g., receiver pull-down insertion) may be sought. The calibration reset state may be exited when a specified value is set or when another specified value is not set. If the receiver is detected or a timeout is reached, the transmitter calibration state may be exited.

Transmitter calibration status: for transmitter calibration. May be the timing state assigned for transmitter calibration. May include signaling on the channel. The EOS may be driven continuously, such as an electrical idle exit ordered set (EIEIOS). May exit when calibration is complete or the timer expiresAnd out to a compliant state. If the counter has expired or a secondary timeout has occurred, the transmitter detection state may be exited.

Transmitter detection state: defining valid signaling. A handshake state may be possible in which the proxy completes the action and exits to the next state based on remote proxy signaling. The receiver may qualify the active signaling from the transmitter. In one embodiment, the receiver looks for wake-up detection and, if debounce on one or more lanes, on the other lane. The emitter drives the detection signal. The polling state may be exited in response to a debounce and/or timeout being completed for all channels or if a debounce on all channels is not completed and there is a timeout. Here, one or more of the monitor channels may remain awake to debounce the wake-up signal. And if debounced, potentially debounce for other lanes. This may enable power savings in low power states.

Polling state: the receiver adapts to lock bits (e.g., by initializing a drift buffer), to lock bytes (e.g., by recognizing symbol boundaries), and to lock training sequence boundaries (e.g., by locking to an Electrically Ordered Set (EOS) and/or training sequence header). The channel may then be deskewed. A handshake may also be completed to exit polling to one of several potential states. For example, an exit to any of the link width state, the compliance state, the loopback flag state, or the reset state may be caused (by the proxy). The handshake may include completion of various defined receiver actions or conditions and transmission of an acknowledgement message (ACK) to trigger exit to a state corresponding to the set of completions of actions and ACKs. For polling deskew, channel-to-channel skew at a remote transmitter may be overlaid at a first length for a highest speed and at a second length for a slow speed. Deskewing can be performed in a slow mode as well as an operational mode. The receiver may have a certain maximum value for deskewing channel-to-channel skew, such as a skew interval of 8, 16, or 32. In some implementations, the receiver action may also include latency remediation. In one embodiment, the receiverThe action may be completed on successful deskew of the valid channel map. In one example, a successful handshake may be achieved when multiple consecutive training sequence headers are received with acknowledgements and multiple training sequences with acknowledgements are transmitted after the receiver has completed its action.

Compliance state: from the polling state. The proxy may be made compliant with the master or slave device for authentication purposes. A device (acting as a master) may send a compliant mode to another device (a slave) and the slave may loop back to the mode after retiming it to its local clock (e.g., without undoing any polarity reversal or lane reversal). When loopback is not functional or is not satisfactory, the compliance mode may be used to characterize the operation of the analog front end on some subset of lanes. For example, the entry into loopback may be pre-conditioned on successful byte locking, TS locking, deskewing, latency testing, and other tests that rely on the proper functioning of several digital logics. With the bit lock completed, compliance may be entered and used for jitter or noise investigation, debugging, link exploration, and other adjustments. In other words, compliance may exit as an alternative to polling if the results of polling do not allow direct entry into the loopback. The compliance state may be utilized to drive the supersequence with a transmitter from the master device. The receiver looks for wake-up on the monitor channel, debounce for wake-up, discard bad channel, adapt and bit lock, etc. The slave device transmitter may drive the compliance mode until its receiver action is complete. The loop back is then retimed and untwisted. The slave receiver performs similar actions of monitoring and debounce. The exit may be to a reset state (such as a timed reset) or to a loopback mode state to begin testing, among other examples. In the event of exiting to the loopback mode state, the master device may be sent (e.g., by a software controller) to the loopback mode state to attempt a more dedicated mode.

Proxy loopback flag state: the loopback flag is a proxy state, but unlike other proxy states, the actions and exits of the master and slave may be different. LoopbackThe slave device may undo any polarity reversal and/or lane inversion, but may not descramble or re-scramble the looped back bits. The reply exchange is not applicable to the slave because it is looping back. Since the slave may deskew before looping back on symbol boundaries, the master may not be forced to re-byte lock or re-deskew, but the master may relock the training sequence to avoid locking to some aliases. Means to do this may include a reseed of the LFSR (re-seed), a comparison TS and/or EIEOS or some combination of these. The end of the SDS marks the end of the loop back setting and the start of pattern generation, checking and counting.

Proxy loopback mode state (or blocking link state): from a loopback tagged state or a compliant state. In the loopback mode state, instead of the control mode, the master device transmitter may send additional dedicated modes to supplement those loopback in the compliant or loopback mode state. The receiver may receive the dedicated mode in the loopback mode and check for errors in the received mode. For transmitter adaptation, both agents may be masters. Within a predetermined period of time, the transmitter may transmit a pattern, and the remote receiver may compare the pattern and determine a metric or a value symbol of the received pattern recorded in a storage element such as a register. The comparison method and metric may depend on the design (e.g., BER with jitter injection). At the end of this period, both agents may exit to reset to return the channel check metric and set the next iteration of transmitter adaptation.

Link width state: the proxy communicates with the final channel map to the remote transmitter. The receiver receives and decodes the information. The receiver may record the configuration channel map in the structure after a checkpoint of the previous channel map value in the second structure. The receiver may also respond with an acknowledgement ("ACK"). An in-band reset may be initiated. As one example, the first state initiates an in-band reset. In one embodiment, exiting to a next state, such as a flit configuration state, is performed in response to an ACK. Further, if a wake-up detection signal is present before entering the low power stateThe frequency falls below a specified value (e.g., 1 number per Unit Interval (UI), such as 4K UI), a reset signal may also be generated. The receiver may maintain current and previous channel mappings. The transmitter may use different channel groups based on the training sequences having different values. In some embodiments, the channel map may not modify some status registers.

Microchip locking configuration state: when both the transmitter and receiver have exited to the blocked link state or other link state, the state entered by the transmitter but considered exited (i.e., the secondary timeout assumption case). In one embodiment, the transmitter exit to link state includes the start of a Training Sequence (TS) boundary and a data sequence (SDS) after receiving the tracked alignment signal. Here, the receiver exit may be based on receiving the SDS from the remote transmitter. The state may be a bridge from the proxy to the link state. The receiver recognizes the SDS. If the SDS is received after the descrambler is initialized, the receiver may exit to a Block Link State (BLS) (or control Window). If a timeout occurs, the exit may be to a reset state. The transmitter drives the channel with a configuration signal. Transmitter exit may be to a reset, BLS, or other state based on a condition or timeout.

Transmission link status: the link status. The flit is sent to the remote agent. May enter from the blocked link state and return to the blocked link state upon an event such as a timeout. The emitter transmits the flit. The receiver receives the flit. The low power link state may also be exited. In some implementations, the Transmission Link Status (TLS) may be referred to as the L0 status.

Blocking link states: the link status. The transmitter and receiver operate in a unified manner. May be a timing state during which the link layer flit is held off and physical layer information is transmitted to the remote agent. The low power link state (or other link state based on design) may be exited. In one embodiment, the Blocking Link State (BLS) occurs periodically. This period is referred to as the BLS interval and can be timed, and can be somewhere between the slow speed and the operating speedDifferent. Note that the link layer transmission flit can be periodically blocked so that a length of physical layer control sequence can be transmitted, such as during a transmission link state or a partial width transmission link state. In some implementations, the Blocking Link State (BLS) may be referred to as the L0 control (or L0 c) state.

Partial width transmission link state: the link status. Power can be saved by entering the partial width state. In one embodiment, asymmetric portion width refers to each direction of a bidirectional link having a different width, which may be supported in some designs. An example of an initiator, such as a transmitter, that sends a partial width indication to enter a partial width transmission link state is shown in the example of fig. 9. Here, a partial width indication is sent while transmitting on the link with the first width to transition the link for transmission at the new second width. Mismatches may cause a reset. Note that the speed may not be changed but the width may be changed. Thus, flits are potentially sent at different widths. The link states may be transmitted similarly to logically; but because of the smaller width, it can take longer to transport the flit. The exit to other link states may be made, such as the exit of a low power link state based on a particular received and transmitted message or a partial width transmission link state or link blocking state based on other events. In one embodiment, as shown in the timing diagram, the transmitter ports may close the idle lanes in an interleaved manner to provide better signal integrity (i.e., noise mitigation). Here, non-retriable flits, such as null flits, can be utilized during periods when the link width is changing. The corresponding receiver may discard the empty flits and close the free channels in an interleaved manner and record the current and previous channel maps in one or more structures. Note that the state and associated state registers may remain unchanged. In some implementations, the partial-width transmission link state may be referred to as a partial L0 (or L0 p) state.

Exiting partial width transmission link state: exiting the partial width state. In some implementations, a blocking link state may or may not be used. In thatIn one embodiment, the transmitter initiates the exit by sending a partial width exit pattern on the idle lane to train and deskew it. As one example, the exit mode begins with EIEOS, which is detected and debounced to a signal that the channel is ready to begin entering a full transmission link state, and may end with SDS or a Fast Training Sequence (FTS) on the idle channel. Any failure during the exit sequence (receiver action, such as not completing deskew before timeout) stops flit transmission to the link layer and asserts a reset, which is handled by resetting the link when the next blocked link state occurs. The SDS may also initialize the scrambler/descrambler on the channel to an appropriate value.

Low power link state: to a lower power state. In one embodiment, it is lower power than the partial width link state, because the signaling in this embodiment is stopped in all lanes and both directions. The transmitter may use the blocking link state for requesting the low power link state. Here, the receiver may decode the request and respond with an ACK or NAK; otherwise a reset may be triggered. In some implementations, the low power link state may be referred to as the L1 state.

In some implementations, state transitions can be facilitated to allow states to be bypassed, for example, when state actions (such as certain calibrations and configurations) of the states have been completed. The previous state results and configuration of the link may be stored and reused in subsequent initializations and configurations of the link. Rather than repeating such configuration and state actions, the corresponding states may be bypassed. However, conventional systems that implement state bypassing often implement complex designs and expensive verification escapes. In contrast to using traditional bypass, in one example, HPI may utilize a short timer in certain states, such as where the state actions do not need to be repeated. This may potentially allow for more consistent and synchronized state machine transitions, as well as other potential advantages.

In one example, a software-based controller (e.g., via an external control point for the physical layer) may enable a short timer for one or more particular states. For example, for a state where an action has been performed and stored, the state may be timed out short to facilitate a quick exit from that state to the next state. However, if the previous state action failed or cannot be applied within a short timer duration, a state exit may be performed. Further, the controller may disable the short timer, for example, when the state action should be re-executed. A long or default timer may be set for each respective state. A state exit may occur if the configuration action in this state cannot be completed within the long timer. The long timer may be set to a reasonable duration to allow completion of the state action. Conversely, short timers may be much shorter, making it impossible in some cases to perform state actions without reference back to previously performed state actions, among other examples.

In some instances, during initialization (or re-initialization) of a link, one or more faults or state exits that result in a state reset (e.g., reset to a reset state or other state) may occur as an agent progresses through the state machine toward an operational link state. In practice, initialization of a link may cycle through one or more states without completing initialization and entering a link state. In one example, the count may maintain the number of non-productive cycles in a state transition within the initialization of the link. For example, a counter may be incremented each time the initialization returns to the reset state without reaching the link state. Once the link successfully enters the link state, the counter for the link may be reset. Such counters may be maintained by agents on both sides of the link. Further, the threshold may be set, for example, by a software-based controller utilizing one or more external control points. When the count of non-productive cycles meets (or exceeds) the defined threshold, initialization of the link may be suspended (e.g., set and maintained before or in a reset state). In some implementations, to restart initialization and release initialization from a suspended state, a software-based controller may trigger a restart or reinitialization of the link. In some instances, a software-based tool may analyze the nature of the suspended initialization and perform diagnostics, set register values, and perform other operations, thereby preventing further cycling of the initialization. Indeed, in some implementations, the controller may set a higher counter threshold or even override the counter in association with restarting the suspended link initialization, among other examples.

In some implementations of HPI, supersequences may be defined, each supersequence corresponding to a respective state or entering/exiting to/from a respective state. The supersequence may comprise a repeating sequence of data sets and symbols. In some instances, the sequence may repeat until completion of a state or state transition or transmission of a corresponding event, among other examples. In some examples, the repeating sequence of the supersequence may repeat according to a defined frequency, such as a defined number of Unit Intervals (UIs). A Unit Interval (UI) may correspond to a time interval for transmitting a single bit on a link or lane of a system. In some implementations, the repeating sequence may begin with an Electrically Ordered Set (EOS). Thus, instances of EOS may be expected to repeat according to a predefined frequency. Such an ordered set may be implemented as defined 16-byte code, which may be represented in hexadecimal format, among other examples. In one example, the EOS of the supersequence may be EIEIOS. In one example, the EIEOS may resemble a low frequency clock signal (e.g., a predefined number of repeated FF00 or FFF000 hexadecimal symbols, etc.). The predefined set of data may follow an EOS, such as a predefined number of training sequences or other data. Such supersequences may be used in state transitions, including link state transitions and initialization, among other examples.

In some implementations of the interconnect (such as in QPI), termination of the serial data link may be facilitated and implemented, such as when the link is reset or initialized. This approach may introduce complexity and time into the initialization of the link. In some implementations of HPI, termination of the link may be maintained, including during reset or re-initialization of the link. Further, HPI may allow hot plug of devices. When another device is introduced (by hot plug or other means), the voltage characteristics of the channel on which the new remote agent is added will change. The local agent may sense these changes in the channel voltage to detect the presence of the remote agent and prompt initialization of the link. State machine states and timers may be defined in the state machine to coordinate detection, configuration, and initialization of the link without termination.

In one implementation, the HPI may support reinitialization of the in-band reset without changing the termination value through screening of the channel by the receiving agent for incoming signaling. Signaling may be used to identify good channels. As an example, the lanes may be screened for any one of a set of predefined signals to be transmitted by the transmitter device to facilitate discovery and configuration of the link. In one example, a supersequence corresponding to one or more initialization or re-initialization tasks may be defined. The predefined sequence may include an Electrical Idle Exit Ordered Set (EIEOS), followed by additional sequence data. In some instances, when each device on either side of the channel becomes active, the device may begin sending a supersequence corresponding to a particular initialization state or the like. In one embodiment, two types of pin resets may be supported; a power-up (or "cold") reset and a warm reset. A reset initiated by software or initiated on one agent (in the physical layer or another) may be transferred in-band to another agent. However, due to the use of an embedded clock, an in-band reset may be handled by communication with another agent using an ordered set such as a particular electrical ordered set or EIOS.

The ordered set may be sent during initialization, and the PHY control sequence (or "blocking link state") may be sent after initialization. Blocking the link state may block the link layer from sending flits. As another example, link layer traffic may be blocked to send several NULL flits that may be dropped at the receiver.

As introduced above, in one embodiment, initialization may be done initially at a slow rate, followed by fast initialization. Initialization at slow speed uses default values for registers and timers. The software then uses the slow link to set up registers, timers and electrical parameters and clears the calibration signal (semaphore) to pave the way for fast initialization. As one example, initialization may include states or tasks such as reset, detect, poll, compliance, and configure, as well as potentially other states or tasks.

In one example, the link layer blocking control sequence (i.e., the Blocking Link State (BLS) or L0c state) may include a timing state during which the link layer flit is held off while the PHY information is transmitted to the remote agent. Here, the transmitter and receiver may start a block control sequence timer. And upon expiration of the timer, the transmitter and receiver may exit the blocked state and may take other actions, such as exiting to a reset, exiting to a different link state (or other state), including a state that allows flits to be sent across links.

In one embodiment, link training may be provided and include transmitting one or more of a scrambled training sequence, an ordered set, and a control sequence, such as in relation to a defined supersequence. The training sequence symbols may include one or more of: header, reserved portion, target latency, logarithm, physical channel mapping code reference channel or group of channels, and initialization state. In one embodiment, a header with an ACK or NAK may be sent, among other examples. As an example, the training sequence may be transmitted as part of a supersequence and may be scrambled.

In one embodiment, the ordered set and the control sequence are not scrambled or interleaved and transmitted equally, simultaneously, and completely on all lanes. Efficient receipt of the sorted set may include examining at least a portion of the sorted set (or for the entire sorted set of a partially sorted set). The ordered set may comprise an Electrically Ordered Set (EOS), such as an Electrically Idle Ordered Set (EIOS) or an EIEOS. The supersequence may comprise the start of a data sequence (SDS) or a Fast Training Sequence (FTS). Such set and control supersequences may be predefined and may have any pattern or hexadecimal representation and any length. For example, the ordered set and the supersequence may be 8 bytes, 16 bytes, 32 bytes, or the like in length. As an example, FTS may additionally be used for fast bit locking during exit of partial width transmission link state. Note that FTS definitions may be per-channel and may utilize a round-robin version of FTS.

In one embodiment, the supersequence may include an insertion of an EOS (such as an EIEOS) in the training sequence stream. In one implementation, the lanes are powered up in an interleaved fashion when signaling begins. However, this may result in the original supersequence appearing to be truncated at the receivers on some channels. However, the supersequence may be repeated over a short interval, e.g., approximately one thousand unit intervals (or-1 KUI). The training supersequence may additionally be used for one or more of: deskew, configuration, and for communicating initialization targets, channel maps, and the like. EIEOS can be used for one or more of the following: transition of a channel from inactive to active state, screening for good channels, recognizing symbols and TS boundaries, among other examples.

In one embodiment, the clock may be embedded in the data, so there is no separate clock channel. Flits sent over a lane may be scrambled to facilitate clock recovery. As one example, the receiver clock recovery unit may deliver the sampling clock to the receiver (i.e., the receiver recovers the clock from the data and uses it to sample the incoming data). The receiver is continuously adapted to the incoming bit stream in some implementations. By embedding the clock, pin allocation (pinout) can potentially be reduced. However, embedding a clock in the in-band data may change the way in which the in-band reset is treated. In one embodiment, a Blocking Link State (BLS) may be utilized after initialization. Also, the electrically ordered set supersequence may be used during initialization to facilitate reset (e.g., as described above) and other considerations. The embedded clock may be common between devices on the link, and a common operating clock may be set during calibration and configuration of the link. For example, the HPI link may reference a common clock with drift buffering. Such an implementation may enable lower latency, among other potential advantages, than elastic buffering used in non-common reference clocks. Further, the reference clock distribution section may be matched to within specified limits.

In some implementations, HPI may support flits having widths that are, in some cases, not multiples of the nominal lane width (e.g., 192-bit flit width and 20 lanes are used as a purely illustrative example). Indeed, in implementations that allow for partial width transmission states, the number of lanes through which flits are transmitted may fluctuate (even during the lifetime of the link). For example, in some instances, the flit width may be a multiple of the number of active lanes at one instant, but may not be a multiple of the number of active lanes at another instant (e.g., as the link changes state and lane width). In instances where the number of lanes is not a multiple of the current lane width (e.g., an example of 192-bit flit width over 20 lanes), in some embodiments consecutive flits may be configured to be transmitted to overlap over the lanes, thereby preserving bandwidth (e.g., transmitting five consecutive 192-bit flits that overlap over 20 lanes).

Fig. 8 illustrates a representation of the transmission of successive flits overlapping on multiple channels. For example, FIG. 8 shows a representation of five overlapping 192-bit flits sent on 20 lane links (lanes represented by columns 0-19). Each cell of FIG. 8 represents a corresponding "nibble" or four-bit packet (e.g., bits 4n +3:4 n) included in a flit sent over a 4UI span. For example, a 192-bit flit may be divided into 48 nibbles. In one example, nibble 0 includes bits 0-3, nibble 1 includes bits 4-7, and so on. The bits in the nibble may be sent to overlap or interleaved (e.g., "swizzled") so that the higher priority fields of the flit are presented earlier, error detection attributes (e.g., CRC) are maintained, and other considerations. Indeed, the swizzling scheme may also provide that some nibbles (and their corresponding bits) are sent out of order (e.g., as in the examples of fig. 8 and 9). In some implementations, the blending scheme may depend on the architecture of the link layer and the format of the flits used in the link layer.

Bits (or nibbles) of flits having lengths that are not multiples of the active lane may be swizzled, such as according to the example of fig. 8. For example, nibbles 1, 3, 5, 7, 9, 12, 14, 17, 19, 22, 24, 27, 29, 32, 34, 37, 39, 42, 44, and 47 may be sent during the first 4 UI. Nibbles 0, 2, 4, 6, 8, 11, 13, 16, 18, 21, 23, 26, 28, 31, 33, 36, 38, 41, 43, and 46 may be sent during the next 4 UI. In UI 8-11, only 8 nibbles remain in the first flit. These last nibbles (i.e., 10, 15, 20, 25, 30, 40, 45) of the first flit can be sent simultaneously with the first nibbles (i.e., nibbles 2, 4, 7, 9, 12, 16, 20, 25, 30, 35, 40, 45) of the second flit so that the first and second flits overlap or are swizzled. With this technique, in this example, five complete flits can be sent in 48UI, with each flit being sent over a fractional 9.6 UI period.

In some examples, blending may produce periodic "clean" microchip boundaries. For example, in the example of fig. 8, the first 5 flit boundaries (top row of the first flit) can also be referred to as clean flit boundaries because all channels transmit the first nibble from the same flit. The proxy link layer logic may be configured to identify a swizzle of channels and may reconstruct flits from swizzled bits. In addition, the physical layer logic may include functionality for identifying when and how to mix flit data streams based on the number of channels in use at the time. Indeed, in transitioning from one link width state to another, the agent may configure itself to identify how the mix of data flows will be employed.

As described above, a link may transition between lane widths, in some instances, operating at the original or full width, and later transition to (and from) a partial width utilizing fewer lanes. In some instances, the defined width of the flit may be divisible by the number of channels. For example, the example diagram of fig. 9 illustrates an example in which the 192-bit flit of the previous example is transmitted over an 8-lane link. As represented in fig. 9, the 4-bit nibbles of the 192-bit flit may be evenly distributed and transmitted over 8 lanes (i.e., since 192 is a multiple of 8). In fact, a single flit can be sent on 24 UIs when operating with 8-channel section width. Furthermore, each microchip boundary may be clean in the example of fig. 9. While clean flit boundaries may simplify state transitions, determinism, and other features, flit boundaries that allow swizzling and occasional jagging may allow for minimization of wasted bandwidth on the link.

Additionally, although the example of FIG. 9 shows channels 0-7 as channels that remain active in a partial width state, any set of 8 channels could potentially be used. It is also noted that the above examples are for illustrative purposes only. A flit can potentially be defined to have any width. The link may also potentially have any link width. Furthermore, the blending scheme of the system may be flexibly constructed based on the format and fields of the flits, the preferred channel width in the system (and other considerations and examples).

The operation of the HPI PHY logical layer may be independent of the underlying transport medium in the event that latency does not generate latency repair errors or timeouts at the link layer, among other considerations.

Link layer

The link layer may abstract the physical layer from the protocol layer, handle flow control between two protocol agents, and provide virtual channel services to the protocol layer (message class) and the routing layer (virtual network). In some implementations, the link layer may handle a fixed share of information called flits. In one example, a flit may be defined to be 192 bits long. However, any range of bits such as 81-256 (or more) may be used in different variations. Large flit sizes, such as 192 bits, may include format, Cyclic Redundancy Check (CRC), Error Correction Code (ECC), and other features. For example, a larger flit length may also allow the CRC field to be extended (e.g., to 16 bits) to handle larger flit payloads. The number of physical flits or Unit Intervals (UIs) used to transmit a single flit (e.g., the time to transmit a single bit or physical flit, etc.) may vary with link width. For example, a 20-lane or bit link width may potentially transfer a single 192-bit flit in 9.6 UIs, while an 8-lane link width transfers the same flit in 24 UIs, among other potential examples. Link layer crediting and protocol packetization may also be performed on a flit basis.

Fig. 10 illustrates a representation 1000 of a generalized flit for an 8-lane link width. Each column of representation 1000 may symbolize a link channel and each row may symbolize a corresponding UI. In some implementations, a single flit can be subdivided into two or more slots. Different messages or link layer headers may be contained in each slot to allow multiple different (and in some cases independent) messages corresponding to potentially different transactions to be sent in a single flit. Further, multiple messages included in a slot of a single flit may be destined for different destination nodes, among other examples. For example, the example diagram of fig. 10 shows a flit format with three slots. The shaded portion may represent the portion of the flit that is included in the corresponding slot.

In the example of fig. 10, three slots are provided, slots 0, 1 and 2. Slot 0 may be provided with 72 bits of flit space, with 22 bits dedicated to the message header field and 50 bits for the message payload space. Slot 1 may be provided with 70 bits of flit space, with 20 bits dedicated to the message header field and 50 bits for the message payload space. The difference between the message header field spaces may be optimized to provide that certain message types will be designated as being included in slot 0 (e.g., where more message header encoding is used). A third slot (slot 2) may be provided which takes up much less space than slots 0 and 1, in this case using 18-bit flit space. Slot 2 may be optimized to handle messages that do not take larger message payloads, such as replies, credit returns, etc. Further, a floating payload field may be provided that allows the additional 11 bits to be applied instead to supplement the payload field of slot 0 or slot 1.

Continuing with the specific example of FIG. 10, other fields may be global to the flit (i.e., applied across flits rather than to a particular slot). For example, the header bits may be provided with a 4-bit flit control field that may be used to specify information such as the virtual network of the flit, identify how the flit is to be encoded, and other examples. Further, error control functionality may be provided, such as by a 16-bit cyclic CRC field, among other potential examples.

Flit formats may be defined to optimize throughput of messages on the link layer. Some conventional protocols have used smaller flits without slots. For example, in QPI, 80-bit flits are used. Although the throughput of larger (e.g., 192-bit flits) flits may be lower, the message or packet throughput may be increased by optimizing the use of flit data. For example, in some implementations of QPI, the entire 80-bit flit space is used regardless of message size or type. By subdividing larger flits into slots of predetermined length and field, 192 flit length can be optimized, thereby achieving greater efficiency, even when one or more of the available slots are sometimes not used. In practice, it may be assumed that link layer traffic includes many different types of messages and traffic, including messages and packets with varying header lengths and fields. The respective lengths and organization of the slots defined in the flit can be defined to correspond to the statistical or expected frequency of the various messages and the requirements of the messages. For example, two larger slots may be defined for each small slot to accommodate expected statistical frequencies of message delivery using these larger message types and header lengths, among other examples. In addition, flexibility may also be provided, such as by floating payload fields, to further accommodate changing traffic, as in the example of fig. 10. In some examples, the flit format can be fixed (including bits dedicated to particular slots in the flit).

In the example of fig. 10, a "Hdr" field may be provided for a flit generally and represents a header indication for the flit. In some instances, the Hdr field may indicate whether the flit is a header flit or a data flit. In a data flit, the flit may still remain slotted, but the use of certain fields is omitted or replaced with payload data. In some cases, the data field may include an opcode and payload data. In the case of a header flit, various header fields may be provided. In the example of fig. 10, an "Oc" field may be provided for each slot, the Oc field representing an opcode. Similarly, one or more slots may have a corresponding "msg" field that indicates the message type of the corresponding packet to be included in the slot (in the case where the slot is designed to handle such packet types, etc.). The "DNID" field may represent a destination node ID, the "TID" field may represent a transaction or tracker ID, the "RHTID" field may represent a requester node ID or a home tracker ID, among other potential fields. Further, one or more slots may be provided with a payload field. Further, a CRC field may be included in the flit to provide a CRC value for the flit, among other examples.

In some implementations, the link width may vary over the lifetime of the link. For example, the physical layer may transition between link width states, such as from or to full or original lane widths and different or partial lane widths. For example, in some implementations, a link may be initialized to transmit data on 20 lanes. The link may then transition to a partial width transmit state where only 8 lanes are actively used, as well as many other potential examples. Such channel width transitions may be utilized, for example, in conjunction with power management tasks governed by one or more Power Control Units (PCUs), among other examples.

As described above, link width may affect flit throughput. Fig. 11 is a representation of an example 192-bit flit sent over an 8-lane link, resulting in a throughput of the flit at 24 UI. Further, as shown in the example of fig. 11, in some instances, bits of flits may be sent out of order, e.g., to send more time sensitive fields earlier in the transmission (e.g., flit type fields (e.g., data or header flits), opcodes, etc.), to preserve or facilitate particular error detection or other functionality embodied in flits, among other examples. For example, in the example of fig. 11, bits 191, 167, 143, 119, 95, 71, 47, and 23 are sent in parallel on lanes L7 through L0 during the first UI of a transfer (i.e., UI 0), while bits 168, 144, 120, 96, 72, 48, 24, and 0 are sent during the 24 th (or last) UI of a flit transfer (i.e., UI 23). It should be appreciated that other ordering schemes, flit lengths, channel widths, etc. may be used in other implementations and examples.

In some examples, the length of the flit can be a multiple of the number of active lanes. In such instances, the flits can be transferred evenly across all active lanes, and the transfer of flits can end up with clean (i.e., non-overlapping) boundaries substantially simultaneously. For example, as shown in the representation of fig. 8, it may be considered that bits of a flit are transmitted in 4-bit contiguous packets or "nibbles". In this example, one 192-bit flit would be transmitted over an 8-lane link. Since 192 is a multiple of 8, the entire flit can be transferred cleanly over an 8-channel link in 24 UIs. In other examples, the width of the flit may not be a multiple of the number of active lanes. For example, fig. 9 shows another representation of example 192 bits transmitted on 20 lanes. Since 192 cannot be evenly divided by 20, the transfer of a complete flit will require a non-integer number of intervals (e.g., 9.6 UIs). In this case, the second overlapping flit can be transferred with the last bit of the previous flit, as opposed to wasting "extra" channels that were not used during the 10 th UI of the transfer. In some implementations, this overlapping or swizzling of flits may result in ragged flit boundaries and out-of-order sending of flit bits. The mode for transmitting may be configured to allow more time sensitive fields in the flit to be transmitted earlier in the flit, reservation of error detection and correction, and other considerations. Logic may be provided in one or both of the physical layer and the link layer to transfer flit bits according to such patterns and to dynamically change between the patterns based on the current link width. Other logic may be provided to rearrange and reconstruct flits from such a swizzled or ordered bitstream, among other examples.

In some implementations, flits can be characterized as header flits (e.g., carrying packet header data) or data flits (e.g., carrying packet payload data). Returning to fig. 10, a flit format can be defined to include three (3) different slots (e.g., 0, 1, and 2), allowing up to three headers to be transmitted in a single flit (e.g., one header in each slot). In the example of fig. 10, three slots are provided, slots 0, 1 and 2. Slot 0 may be provided with 72 bits of flit space, with 22 bits dedicated to the message header field and 50 bits for the message payload space. Slot 1 may be provided with 70 bits of flit space, with 20 bits dedicated to the message header field and 50 bits for the message payload space. The difference between the message header field spaces may be optimized to provide that certain message types will be designated as being included in slot 0 (e.g., where more message header encoding is used). A third slot (slot 2) may be provided which takes up much less space than slots 0 and 1, in this case using 18-bit flit space. Slot 2 may be optimized to handle messages that do not take larger message payloads, such as replies, credit returns, etc. Further, a floating payload field may be provided that allows the additional 11 bits to be applied instead to supplement the payload field of slot 0 or slot 1.

In some implementations, by allowing the field to float between two slots, additional bits may be provided for certain messages as needed while still remaining within a predefined flit length (e.g., 192 bits) and maximizing bandwidth utilization. Turning to the example of FIG. 19, two instances 1905, 1910 of an example 192-bit flit are shown on an 8-lane data link. In one example, a microchip (e.g., 1905) may include three slots, slots 0, 1, and 2. Each of slots 0 and 1 may include a 50-bit payload field. A floating field may be provided to alternatively extend the payload field of either slot 0 or slot 1 by the field length (e.g., 11 bits) of the floating field. The use of floating fields may also extend the efficiency gains provided by the defined multi-slot flit format. The sizing of slots within a flit and the type of messages that can be placed in each slot can potentially provide increased bandwidth (even if the flit rate is reduced).

In the particular example of FIG. 10, messages that may use slots 1 and 2 may be optimized to reduce the number of bits set aside for encoding the opcodes for these slots. When a header with more bits that slot 0 can provide enters the link layer, a slotting algorithm can be provided to allow it to take over slot 1 payload bits to achieve additional space. Special control (e.g., LLCTRL) flits may also be provided that consume all three slots worth of bits for which they are needed. In the case where the link is partially busy, a slotting algorithm may also exist to allow each slot to be used while other slots do not carry information.

In one embodiment, the link layer may additionally define special control flits that can be used for, for example, debug messages and other purposes. In one example, the LLCTRL-DEBUG flit may be sent by the link layer when the enable DEBUG control field is set. When this bit is not set, the LLCTRL-DEBUG flit may not be transmitted on the link. Debug packets may be important to expose the internal state of an otherwise inaccessible device connected through HPI. The contents of the debug packet may also be implementation specific. The content may include such things as branch information (source and target IPs), timestamps, indications of internal event triggers, and the like. The exposed data may be, for example, by monitoring devices such as logic analyzers for post-processing and fault analysis. An example flit encoding for a debug message type is illustrated in fig. 12.

In some implementations, the buffered memory interface and corresponding protocol may be implemented using the principles and protocols of a general purpose I/O interconnect (GPIO) architecture such as HPI. For example, the physical layer and link layer definitions listed above may also be implemented in a buffered memory protocol. In practice, the logic for the physical layer and link layer supporting the GPIO protocol may be reused at the interface supporting the buffered memory protocol. The cache memory protocol may also share message classes such as request, response, and write-back message classes, among other examples. Although the opcode value within the buffer memory protocol message may be interpreted differently than in HPI (or another GPIO protocol), the same general packet format may be used in both the buffer memory protocol and the GPIO interconnect on which it is built. Indeed, the additional packet and flit fields may be unique to both the buffer memory protocol and the GPIO interconnect protocol.

In some implementations, the buffer memory protocol utilizing the physical layer and the link layer of HPI may be a transaction interface and protocol. This may allow data to be returned out of order to a series of received requests (e.g., read requests). A buffer memory interface may be used to interconnect the buffer chips to the CPU in some implementations. In one implementation, while some protocols send DRAM commands such as activate, read, write, and refresh across a buffered memory bus, in some implementations of the buffered memory protocol discussed herein, a read or write command may simply be sent with an address. The buffer chip may then decode the particular rank and group and break the request down into DRAM commands. Speculative reads and demand reads may also be supported.

Turning to fig. 13, a simplified block diagram 1300 is shown illustrating an example topology of a computing system including CPU devices 1305, 1310 interconnected by HPI compliant links. Each CPU 1305, 1310 may be similarly coupled to one or more respective buffer devices 1315a-1 using a corresponding buffer memory protocol link ("MemLink"). As described above, in some implementations, the buffer memory protocol interconnect may be a GPIO-based protocol in that the physical layer and the link layer of the buffer memory protocol are the same physical and link layer definitions based on the GPIO protocol (e.g., HPI). Although not shown in fig. 13, the CPUs 1305, 1310 may further be connected to one or more downstream devices using a GPIO protocol.

As further shown in the example of FIG. 13, buffer device 1315a-1 may be connected to a memory device, such as a dual in-line memory module (DIMM) device. The memory corresponding to each buffer device may be considered local to the CPU (e.g., 1305, 1301) to which the buffer device is connected. However, other devices (including other CPUs) may access the memory through other slots using GPIO protocol compliant links. In some implementations, the port running the buffer memory protocol may support only commands for communicating with the memory and only the buffer memory protocol (i.e., not the GPIO protocol and the buffer memory protocol). Additionally, in some implementations, the GPIO interconnect protocol may support routing and indicate information such as the requesting node identifier and the destination node identifier (e.g., in packets thereof). On the other hand, the buffer memory protocol may be a point-to-point interface that does not utilize routing. Therefore, some fields used in the GPIO protocol may be omitted in packets sent using the buffer memory interface. Alternatively, a field for use by the host carrying address decode information to the buffer may be specified, among other examples.

In other implementations, buffer device 1315a-1 may support a two-level memory topology, where a certain amount of fast memory (e.g., DRAM) is used as a cache for larger, slower memory (e.g., non-volatile memory). In one such implementation, one or more of buffer devices 1315a-1 may use DDR as near, fast memory and transactional DDR DIMMs as larger "far" memory, among other examples. Transactional DIMMs can use a protocol (e.g., DDR-transaction (DDR-T)) to communicate to a volatile memory single in-line memory module (SIMM) using transactional commands.

Turning to fig. 14, a simplified block diagram 1400 is presented showing a more detailed representation of buffer chips 1315 connected to near and far memory modules (e.g., 1405, 1410). As shown in the above example, the buffer chip 1315 may be communicatively coupled to a CPU device or other processing device 1305 by a buffer memory protocol interconnect link 1415 using a defined buffer memory protocol. Buffer chip 1315 may include logic to perform additional functions. For example, an implementation of buffer chip 1315 may include memory controller 1420, scheduler 1425, flow control logic 1430, error detection logic 1435, and directory state management logic 1436. Logic may be implemented using hardware circuitry, firmware, and/or software.

In some implementations, memory controller 1420 may include logic to translate requests into memory specific protocols (e.g., DDR 4) for its memory devices. The memory controller 1420 may extract details of these memory-specific protocols from host controller logic 1440 of the CPU 1305 (or another host) that communicates with the buffer chip 1315 using link 1415. The scheduler 1425 may include logic to reorder requests and arbitrate responses in an attempt to achieve higher performance. Buffer chip 1315 may additionally provide features such as flow control and error detection. In some implementations, flow control logic 1430 may be implemented in scheduler 1425 or otherwise interoperate with scheduler 1425 to ensure higher efficiency and higher performance of buffer chip 1315. Error detection logic 1435 may include logic to support an Error Correction Code (ECC) detection scheme and logic to initiate correction or processing of detected errors. The directory state management logic 1436 may maintain the directory state for each row in memory. In response to some transactions, the directory state of the corresponding line may change. The directory state management logic 1436 may update a record identifying the current directory state of each line in memory. In some cases, directory state management logic 1436 may receive data from a home agent (e.g., at host 1305) to indicate a change to the directory state of a line of memory. In other instances, the directory state management logic 1436 may automatically update the directory state based on the nature change inclusion logic of the corresponding request (i.e., no host to indicate the change). This may save bandwidth and other example advantages because bandwidth to indicate at least a portion of the directory state changes may be reserved, if the directory state management logic 1436 allows identification of a portion of these directory state changes directly at the memory controller (e.g., 1420 at 1315).

In some implementations, the CPU 1305 may include host controller logic 1440 that can translate CPU (or host) addresses into memory addresses, including arrangements across channels and slots, as well as other functionality. Host 1305 may include multiple processing cores 1445a-d that may allow multiple processes to execute in parallel on host 1305. Additionally, the host may use one or more caches to cache previously accessed lines from memory (e.g., 1405, 1410) so that the host 1305 may re-access the lines without re-requesting data from the buffer chip 1315. This may help to address the higher latency introduced by buffer chip 1315 in the buffer memory architecture. The caches may include local and intermediate caches, as well as Last Level Caches (LLC) 1450 a-d. A cache (including the LLC) may be shared among multiple cores 1445 a-d. Indeed, various cache architectures may be used with a host that employs one or more processing cores.

As described above, the link layer of the buffer memory protocol may be a link layer based on the GPIO protocol. For example, a 192-bit, 3-slot flit of the HPI protocol (e.g., as described and illustrated in fig. 10) may be used in a cache memory protocol. A packet in the buffer memory protocol may implement the entire command. The packet may be implemented as one or more flits. 15A-15C may represent an implementation of flits in a buffer memory protocol. In some implementations, the read data may be sent back from the buffer device to the host via a combination of memory data with and without headers (MemData) packets. The entire cache line may be transferred in three headerless flits (instead of five flits, as with a header flit). Thus, header-less packets may provide higher throughput on what may be the most restricted portion of the interface. To accommodate three flits, the header-less packet removes some fields and makes assumptions on its value. Most of the assumptions made are likely to be true, but all reads return, and if any assumption is not true, a header-with packet containing all fields will be used. In some implementations, the header-bearing packet may begin a read return sequence to allow early identification of a tracker identifier (e.g., HTID, RTID, etc.) for mapping read returns to read requests in the transactional buffer memory protocol.

In some implementations, the header-less packet may contain the information listed in table 1:

table 1: header-less packet field

Field(s)	Qty	Description of the invention
			Data of	512	64 byte cache line data
Directory	2	Two directory bits storing data that is typically part of an ECC bit
			HTID	11	A transaction ID is requested. Packet includes data for early HTIDNext one isThe HTID of the data packet is read.
CRC	48	16-bit CRC per flit
			HIB	3	The header indicates the bit. One for each microchip. This bit is always 0 for each flit of a header-less packet.

Further, fig. 15A-15C illustrate a sequence of three flits (e.g., flit 0 (at fig. 15A), flit 1 (at fig. 15B), and flit 2 (at fig. 15C)) that can be used to transmit an entire cache line in a read return. Table 2 provides a summary of an example of three flits employed in a header-less packet:

table 2: header-less packet

Micro-tablet	Content providing method and apparatus
		Data flit 0	The first 32 bytes of data from the cache line
Data microchip 1	The remaining portion of the first 32 bytes plus the first portion of the second 32 bytes
		Data microchip 2	The remaining part of the second 32 bytes

The HIB bit may be set to indicate whether the packet is with or without a header. The HIB may be set to indicate a packet without a header only for the data flit following the header flit. This allows some other packets to be interleaved in between the packets. When the interleaved packet is complete, the data portion of the original header-bearing packet may be sent in flits with HIB = 0. This technique can also be used to delay data packets by inserting empty flits. In some implementations, the start of a header-less packet is signaled by the HIB bit being 0 when a header would normally be needed. All other packets (including empty flits) may contain a header with the HIB bit set.

As described above, the headed packet may be a total of five flits in length and may include one header flit and 4 data flits. To reduce idle latency, the flit with a header may send data flit 1 twice. The first time, with only data from the first 32 bytes, is called flit la. This allows flits to be sent before the second 32 bytes of data are available from the memory device. As a result, critical 32-byte blocks of data arrive earlier at the host in idle situations. Flit 1 can then be sent a second time, this time complete. It will contain a repetition of the first 32 byte data portion plus a second 32 byte portion. Table 3 describes five flits with header packets:

table 3: packet with header

Micro-tablet	Content providing method and apparatus
		Header	Packet header
Data flit 0	The first 32 bytes of data from the cache line
		Data flit 1a	Only the remainder of the first 32 bytes. The second 32 byte section is reserved. Other flit payload bits (RTID [6:3 ]]) Same as data flit 1
Data microchip 1	The microchip 1 is completed. The first 32 byte portion is retransmitted. Also includes a second 32-byte portion
		Data microchip 2	The remaining portion of the second 32 bytes.

In some implementations, the headedmendata packet may not be cancelled and will be followed by a data flit. The header is not sent until it is known that data can also be sent. A complete cache line may not be accumulated in the buffer before the header is sent.

In conventional systems, home agents on various hosts may be assigned the task of explicitly updating the directory state of the corresponding row stored in memory after each read. In particular, updating the directory state may involve writing back the complete cache line plus the directory state. Such an implementation may result in a large amount of bandwidth for updating directory state, particularly in cache architectures where the home agent and memory controller are not tightly coupled. A buffer memory link may be used to interconnect the home agent and the memory controller. Thus, this buffer memory link may be used by the memory controller to receive both host requests and to receive write backs, the only purpose of which may be in some cases simply to specify a read-based change to the directory state of a given row. Furthermore, where directory state is indicated within the cacheline itself, a directory state change may involve the transmission of an entire copy of the cacheline from the home agent to the memory controller over the cache link, resulting in a high bandwidth cost for such updates.

These are other problems that may be addressed using an improved architecture that utilizes new command types to allow the memory controller to explicitly or implicitly direct the memory controller to update the directory state, if possible, without direction from the home agent or other logic outside of the memory controller. Thus, significant write bandwidth on the interface between the home agent and the memory controller may be saved. This interface may be a buffer memory link interface. Other architectures may also utilize such a model, such as implementing the internal grid of memory controllers on the same system on chip (SoC) as the home agent, among other examples.

16A-16C are diagrams 1600a-C illustrating example read transactions on a buffer memory interconnect involving a host 1305 and a buffer chip 1315. Although this example specifically utilizes read transactions, it should be appreciated that the principles shown in the example of fig. 16A-16C may also be applied to other types of requests (including memory invalidation requests). Returning to the illustrations in fig. 16A-16C, a memory read 1610 may be sent from the host 1305 to the buffer chip 1315 over the buffer memory interconnect. The memory read or read request 1610 may originate from the local host 1305 or from another remote host connected (e.g., via an HPI link, among other examples) to the local host 1305. A memory controller of the buffer chip 1315 may receive the read request 1610 and determine a corresponding memory read operation 1615 to perform a retrieval of a corresponding row from the memory 1605 in response to the request.

Read request 1610 may be a request according to a buffered memory protocol. The buffered memory protocol read request 1610 may correspond to a read request 1650 of a general purpose input/output (GPIO) interconnect protocol. The GPIO protocol may also support several different request types including read and memory invalidate requests. In one example, the GPIO protocol request types may include:

RdCur: requesting uncacheable "snapshot" copies of a particular cache line

RdCode: requesting a shared (or "S") copy of a cache line

RdData: requesting an exclusive (or "E") or S copy of a cache line

RdDataMig: requesting modification (or "M"), E, or S copies of a cache line

Rdivown: requesting an M or E copy of a cache line

RdInv: requesting an E copy of the cacheline from the home agent; any modified copies are committed to memory before receiving the data

InvItoE: requesting ownership of a cache line of E-state without data response

InvXtoi: invalidating all caches, including the requesting agent's cache

InvItoM: request ownership of a cache line in M empty state without data response, with intent to write back modified line shortly thereafter

Non snprd: requesting uncacheable "snapshot" copies of cachelines from memory

As shown in the above example, a set of coherency states may be maintained for the GPIO protocol, including shared ("S"), exclusive ("E"), modified ("M"), and invalid ("I") states, among other potential examples.

The cache protocol itself may define a number of different request types. The cache memory protocol may also maintain a directory state for each cache line, including any ("a") (e.g., M, E, S or any of the I coherency states), shared ("S") (e.g., any of the S or I coherency states), exclusive ("E"), invalid ("I") (e.g., guaranteed to be in an I coherency state), or unknown, among other examples. In one example, the read request types of the buffer memory protocol may include:

MemRd: normal memory read. Directory information retention unmodified

MemRdXtoI: memory reads, directory results are invalid. If the directory indicates anything other than I, it is written back to I

MemInvXtoI: the memory is invalid. The directory result is invalid. No read data is returned. The memory controller reads the cache line and, if it has not set the directory to I, rewrites it by setting the directory to I

MemRdXtoS: memory reads, directory results are shared. If the directory indicates anything other than S, it is written back as S

MemRdData: memory read, overwrite (if necessary) using the following directory: i to A; s is not changed; a is not changed; if a change from A is required, the host must make the change specifically to it

MemRdXtoA: memory reads, directory results are any. If the directory indicates anything other than A, it is written back as A

MemInvXtoA: memory is invalid and directory results are any. No read data is returned. The memory controller reads the cache line and, if it has not set the directory to a, rewrites it by setting the directory to a.

The buffer memory protocol read request type may correspond to a GPIO protocol read request type. The buffer or memory controller may sometimes cover only a subset of the caching agents, and thus the home logic may be responsible for issuing the correct memory command depending on whether the requesting caching agent is covered by the directory. In one implementation, caching agents may group between local and remote caching agents (such as in the examples of tables 4 and 5), but other alternative groupings may be used in other implementations. In the examples of tables 4 and 5, the "local" requestor is an agent that is not covered by the directory, and the "remote" caching agent is covered by the directory. Tables 4 and 5 list the association between GPIO protocol read request types and buffer memory (MEM) protocol request types. Table 4 lists read requests from "local" hosts or caching agents. Table 5 lists read requests originating from "remote" hosts or caching agents.

Table 4: directory state transition for read requests to a local caching agent

Table 5: directory state transition for read requests of remote caching agents

16A-16C, in some implementations, a cache line 1630 may include a field or bit (e.g., 1635) encoded to identify the directory state of the cache line 1630. In response to the read request, a copy of cache line 1630 may be returned to the requesting host (e.g., 1305), which includes encoded directory state bits 1635. Turning to the example of FIG. 16B, in some cases, a read transaction may result in a change to the directory state of the corresponding cache line. In practice, tables 4 and 5 list the state transitions that may occur based on the type of read request and the directory state of the requested cache line at the time of the read request. Further, in some implementations, the directory state transition may also depend on whether the read request originated from a local caching agent or a remote caching agent. For example, an implementation may use the directory state to primarily track the remote agent (e.g., whether the remote agent has a shared copy, an exclusive copy, etc.). In some implementations, the home agent can only know the agents covered by the directory (e.g., remote agents) or not covered by the directory (e.g., local agents), which results in these determinations being made only for one or the other of the remote agent or local agents. The directory bits may reflect the directory state in the remote slot and may be used to determine whether a remote snoop should be performed. In some implementations, the home agent may own a copy of the cacheline, and a read request from the home agent may typically cause the directory state to transition to an invalid ("I") or "clean" state.

As an example, in the illustration of fig. 16B, read request 1610 may result in copy 1630 being retrieved (at 1620) by buffer chip 1315 memory controller and forwarded (at 1640) (over a buffer memory link) to host 1305. In this example, the directory state of cache line 1630 may be "A" when request 1610 is received. The directory state may be encoded in directory state bits 1635. Host 1305 may change the directory state upon receiving a cache line 1630. In some cases, the change may be based on the result of a snoop performed by the home agent of the host. For example, when the request is an "RdCur" read request, host 1305 may snoop the remote caching agent in association with request 1610. In the example of FIG. 16B, based on the results of the snoop, the host determines that the directory state should be updated to "S" (as reflected by bit 1635 a). To allow this directory update to be transferred to the buffer memory, the host may write back 1650 the cache line 1630 with the updated directory state bits 1635 a. The memory controller of the buffer chip may then write 1655 the updated cache line with the updated directory state bits 1635 to memory.

In some cases, the memory controller of the buffer chip 1315 may infer a change in directory state if the change is not communicated back to the buffer chip (e.g., by a write back (e.g., 1650)). For example, as shown in tables 4 and 5, some combination of read request type and current directory state will always result in the same directory change (or no change). In tables 4 and 5, the column "New directory State" shows the directory state that results from a particular type of read request when the cacheline has a particular directory state (and depending on whether the request originated from a local agent or a remote agent). As an example, as shown in Table 5, when the read request is from a remote agent and is a "RdData" request while the directory state is in the "I" state, changes to the directory state to the "A" state can be reliably predicted, as well as other examples including those shown in tables 4 and 5.

In tables 4 and 5, when "new directory state" is indicated as "-", it can be predicted that no change to the directory state will result based on the read request. When "new directory state" is indicated as "", changes are possible, but not predictable. For example, any directory state change may depend on the result of a snoop performed by the requesting host. By way of example, in Table 5, when the read request is from a remote agent and is a "RdData" request while the directory state is in the "A" state, a snoop may be performed by the home agent, resulting in a potentially unpredictable change to the directory state of the cacheline. Thus, where indicated by an ". about.", the directory state may sometimes be updated only by the memory controller through a writeback or other explicit communication from the home agent. In the case where the new directory state is predictable, the memory controller may update the directory state directly without intervention by the local agent.

For illustration, in FIG. 16C, an example of a read request 1610 being received when a cache line is in the "A" state (as recorded in directory state bits 1635) is shown. In this example, based on the type of read request 1610 and the cache line being in the "A" state (and based on the source of the request), the memory controller may determine that the home agent will update the directory state to "I", as shown. The memory controller may identify the type of read request and the source of the request based on the fields of the read request and derive from this information what the directory state is to be changed. The memory controller may preemptively change the directory state bits of the cache line copy in memory (at 1665) by writing a new copy of the cache line with the new directory state bits (1635 b) to memory (at 1645) to reflect the changes that the home agent will make (at 1660) to the directory state. Thus, the memory controller of the buffer chip 1315 may be equipped with logic to determine whether a directory state change may be predicted based on a particular read request type from a particular cache agent when a cache line is in a particular directory state. The memory controller may likewise determine whether it should wait and rely on communications from the home agent to identify unpredictable changes to the directory state. Similarly, the home agent may include logic for determining when updated directory state bits should be written back (or otherwise communicated) to facilitate updating of the directory state of cache lines recorded in memory, or whether the home agent may forego any such communication based on an understanding that a separate memory controller will directly perform such updates. Both the memory controller and the home agent may include or have access to a data structure having machine-readable data embodying information describing a combination of directory state changes, such as the information described in example tables 4 and 5, among other examples.

Fig. 17 is a simplified representation 1700 of a cache line 1630 according to some examples. In this particular example, the cache line may include payload data 1705 as well as Error Correction Code (ECC) data (e.g., in ECC field 1710). In this example, directory state bits 1635 may be included in ECC bits 1710 as stored in DRAM or pushed out of ECC bits 1710. For example, in one example, the directory bits may be two directory bits stored within a 64 byte cache line. The directory state bit 1635 may be set by the host device at write time, updated directly by the memory controller in some read transactions (e.g., as in the example of FIG. 16C), or based on communication by the host in other read transactions (e.g., as in the example of FIG. 16B). In one example, a two-bit directory state bit implementation may be implemented according to the following example of table 6:

table 6: directory state encoding

Dir[1]	Dir[0]	Status of state	Description of the invention	Remote slot caching	Local socket caching
						0	0	I	An invalid state. The cache line is not cached in the remote slot cache. In book The slot cache may be at I, S or F.	Uncached	Any state
0	1	-	Is not used
						1	1	S	And (4) sharing. The cacheline is cached in S or F state, or not cached at all	I. S or F	I. S or F
1	0	A	Any of these. The cache line may be in any cache state in the remote slot (M, E. F, S or I). It may be any state in the local slot	In any state (M, E, F, S or I)	Any state (M, E, F, S or I）

Other implementations may support additional or alternative directory states. In addition, other implementations may use additional directory state bits. Further, while tables 4 and 5 list various read request types and predicted directory state changes resulting from these read requests, other implementations may support alternative sets of read request types and predicted impacts on directory states based on read request types, directory states used, and the source of the read request, as well as other alternatives and examples.

The HPI and associated cache access protocols may be incorporated into any kind of computing device and system, including mainframes, server systems, personal computers, mobile computers (such as tablets, smart phones, personal digital systems, etc.), smart appliances, gaming or entertainment consoles and set-top boxes, among other examples. For example, referring to FIG. 18, a block diagram of a second system 1800 is shown, in accordance with an embodiment of the present invention. As shown in fig. 18, multiprocessor system 1800 is a point-to-point interconnect system, and includes a first processor 1870 and a second processor 1880 coupled via a point-to-point interconnect 1850. Each of processors 1870 and 1880 may be some version of a processor. In one embodiment, 1852 and 1854 are part of a serial point-to-point coherent interconnect fabric (such as a high performance architecture). As a result, the present invention may be implemented within QPI architectures.

Although shown with only two processors 1870, 1880, it is to be understood that the scope of the present invention is not limited in this regard. In other embodiments, one or more additional processors may be present in a given processor. Processors 1870 and 1880 are shown including integrated memory controller units 1872 and 1882, respectively. Processor 1870 also includes as part of its bus controller units point-to-point (P-P) interfaces 1876 and 1878; similarly, second processor 1880 includes P-P interfaces 1886 and 1888. Processors 1870, 1880 may exchange information via a point-to-point (P-P) interface 1850 using P-P interface circuits 1878, 1888. As shown in fig. 18, IMCs 1872 and 1882 couple the processors to respective memories, namely a memory 1832 and a memory 1834, which memory 1832 and memory 1834 may be portions of main memory locally attached to the respective processors.

Processors 1870, 1880 each exchange information with a chipset 1890 via individual P-P interfaces 1852, 1854 using point to point interface circuits 1876, 1894, 1886, 1898. Chipset 1890 also exchanges information with high-performance graphics circuit 1838 via interface circuit 1892 along high-performance graphics interconnect 1839.

A shared cache (not shown) may be included in either processor or external to both processors; but are connected with the processors by a P-P interconnect such that if a processor is placed in a low power mode, local cache information for either or both processors may be stored in the shared cache.

Chipset 1890 may be coupled to a first bus 1816 via an interface 1896. In one embodiment, first bus 1816 may be a Peripheral Component Interconnect (PCI) bus or a bus such as a PCI express bus or another third generation I/O interconnect bus, although the scope of the present invention is not so limited.

As shown in fig. 18, various I/O devices 1814 are coupled to first bus 1816, along with a bus bridge 1818, which couples first bus 1816 to a second bus 1820. In one embodiment, second bus 1820 includes a Low Pin Count (LPC) bus. Coupled to second bus 1820 are various devices including, for example, a keyboard and/or mouse 1822, communication devices 1827 and a storage unit 1828 such as a disk drive or other mass storage device, which typically includes instructions/code and data 1830 in one embodiment. Further, an audio I/O1824 is shown coupled to second bus 1820. Note that other architectures are possible in which the component and interconnect architectures included vary. For example, instead of the point-to-point architecture of fig. 18, a system may implement a multi-drop bus or other such architecture.

Referring now to FIG. 19, a block diagram of an embodiment of a multicore processor is shown. As shown in the embodiment of fig. 19, processor 1900 includes multiple domains. Specifically, core domain 1930 includes a plurality of cores 1930A-1930N, graphics domain 1960 includes one or more graphics engines having media engines 1965, and system agent domain 1910.

In various embodiments, system agent domain 1910 handles power control events and power management such that the units (e.g., cores and/or graphics engines) of domains 1930 and 1960 are independently controllable to operate dynamically at appropriate power modes/levels (e.g., active, turbo, sleep, hibernate, deep sleep, or other states similar to a high-level configuration power interface) depending on the activity (or inactivity) taking place in a given unit. Each of domains 1930 and 1960 may operate at different voltages and/or powers, and further, the units within a domain each potentially operate at independent frequencies and voltages. It is noted that while only three domains are shown, it is understood that the scope of the present invention is not limited in this respect and that additional domains may be present in other embodiments.

As shown, each core 1930 includes a low-level cache in addition to various execution units and additional processing elements. Here, the various cores are coupled to each other and to a shared cache memory, which is formed from a plurality of units or slices 1940A-1940N of a Last Level Cache (LLC); these LLCs often include memory and cache controller functionality and are shared between the cores and potentially also between the graphics engines.

As seen, ring interconnect 1950 couples the cores together and provides interconnections between core domain 1930, graphics domain 1960, and system agent circuitry 1910 via a plurality of ring breaks (stops) 1952A-1952N (each at a coupling between a core and an LLC slice). As seen in fig. 19, interconnect 1950 is used to carry various information including address information, data information, acknowledgement information, and snoop/invalidate information. Although a ring interconnect is illustrated, any known on-die interconnect or structure may be utilized. As an illustrative example, some of the structures discussed above (e.g., another on-die interconnect, an intel system on a chip architecture (IOSF), an Advanced Microcontroller Bus Architecture (AMBA) interconnect, a multi-dimensional mesh structure, or other known interconnect architectures) may be utilized in a similar manner.

As further depicted, the system agent field 1910 includes a display engine 1912 that will provide control and interface to an associated display. The system agent domain 1910 may include other elements, such as: an integrated memory controller 1920 that provides an interface to system memory (e.g., DRAM implemented with multiple DIMMs); coherency logic 1922 to perform a memory coherency operation. There may be multiple interfaces to enable interconnection between the processor and other circuitry. For example, in one embodiment, at least one Direct Media Interface (DMI) 1916 interface and one or more PCIe interface 1914 interfaces are provided. The display engine and these interfaces are typically coupled to memory via a PCIe bridge 1918. In addition, one or more other interfaces may be provided in order to provide communication between other agents (such as additional processors or other circuitry).

While the present invention has been described with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the true spirit and scope of this present invention.

A design may go through various stages, from creation to simulation to fabrication. The data representing the design may represent the design in a number of ways. First, as is useful in simulations, the hardware may be represented using a Hardware Description Language (HDL) or another functional description language. Furthermore, a circuit level model with logic and/or transistor gates may be generated at some stage of the design process. In addition, most designs, at some stage, reach a level of data representing the physical placement of various devices in the hardware model. In the case where conventional semiconductor fabrication techniques are used, the data representing the hardware model may be the data specifying the presence or absence of various features on different mask layers for masks used to produce the integrated circuit. In some implementations, such data may be stored in a database file format such as graphic data system ii (gds ii), Open Art System Interchange Standard (OASIS), or similar formats.

In some implementations, the software-based hardware model as well as the HDL and other functional description language objects can include Register Transfer Language (RTL) files, among other examples. Such objects may be machine parsable such that a design tool may accept an HDL object (or model), parse the HDL object for attributes of the hardware being described, and determine the physical circuit and/or on-chip layout from the object. The output of the design tool may be used to fabricate the physical device. For example, the design tool may determine the configuration of various hardware and/or firmware elements from the HDL object, such as bus widths, registers (including size and type), memory blocks, physical link paths, fabric topology, and other attributes to be implemented to implement the system modeled in the HDL object. The design tools may include tools for determining the topology and fabric configuration of a system on a chip (SoC) and other hardware devices. In some instances, HDL objects may be used as a basis for developing models and design files that may be used by a manufacturing facility to manufacture the described hardware. In practice, the HDL objects themselves may be provided as input to the manufacturing system software to facilitate the described hardware.

In any representation of the design, the data may be stored in any form of a machine-readable medium. Memory or magnetic or optical storage, such as a magnetic disk, may be a machine readable medium to store information transmitted via optical or electrical waves modulated or otherwise generated to transmit such information. When an electrical carrier wave indicating or carrying the code or design is transmitted, a new copy is made of the electrical signal to the extent that copying, buffering, or re-transmission of the electrical signal is performed. Thus, a communication provider or a network provider may store, at least temporarily, an article of manufacture such as information encoded into a carrier wave on a tangible, machine-readable medium to implement the techniques of embodiments of the present disclosure.

A module as used herein refers to any combination of hardware, software, and/or firmware. For example, a module includes hardware, such as a microcontroller, associated with a non-transitory medium to store code adapted to be executed by the microcontroller, and thus, in one embodiment, reference to a module refers to hardware specifically configured to identify and/or execute code to be retained on the non-transitory medium. Furthermore, in another embodiment, the use of a module refers to a non-transitory medium comprising code, which is specifically adapted to be executed by a microcontroller to carry out predetermined operations. And as may be inferred, in yet another embodiment, the term module (in this example) may refer to a combination of a microcontroller and a non-transitory medium. In general, module boundaries that are illustrated as separate typically change and potentially overlap. For example, the first and second modules may share hardware, software, firmware, or a combination thereof, while potentially retaining some independent hardware, software, or firmware. In one embodiment, use of the term logic includes hardware, such as transistors, registers, or other hardware, such as programmable logic devices.

In one embodiment, use of the phrase "configured to" refers to arranging, bringing together, manufacturing, offering to sell, importing and/or designing a device, hardware, logic or element to perform a specified or determined task. In this example, a device or element thereof that is not operating is still "configured to" perform the specified task if designed, coupled, and/or interconnected to perform the specified task. As merely illustrative examples, a logic gate may provide a 0 or a 1 during operation. But the logic gates are "configured" to provide an enable signal to the clock, excluding every potential logic gate that may provide a 1 or a 0. Rather, a logic gate is a logic gate that is coupled in some manner during operation in which either the 1 or 0 output is used to enable the clock. It is again noted that the use of the term "configured to" does not require operation, but instead focuses on the underlying state of the device, hardware, and/or element, where the underlying state of the device, hardware, and/or element is designed to perform a particular task when the device, hardware, and/or element is operating.

Furthermore, in one embodiment, use of the phrases "to," "capable of," and/or "operable to" refer to some devices, logic, hardware, and/or elements designed in such a way as to enable the use of the device, logic, hardware, and/or elements in a specified manner. As noted above, in one embodiment, the use by, capable of, or operable to refer to a potential state of a device, logic, hardware, and/or element is designed in such a way that the device, logic, hardware, and/or element is not operating but is enabled to use the device in a specified manner.

As used herein, a value includes any known representation of a number, state, logic state, or binary logic state. Typically, the use of logic levels, logic values, or logical values is also referred to as 1's and 0's, which represent only binary logic states. For example, a 1 refers to a high logic level and a 0 refers to a low logic level. In one embodiment, a memory cell, such as a transistor or flash memory cell, may be capable of holding a single logical value or multiple logical values. However, other representations of values in a computer system are used. For example, the decimal number 10 may also be represented as a binary value of 1010 and the letter A in hexadecimal. Thus, a value includes any representation of information that can be maintained in a computer system.

Further, a state may be represented by a value or a portion of a value. For example, a first value, such as a logic 1, may represent a default or initial state, while a second value, such as a logic 0, may represent a non-default state. Further, in one embodiment, the terms reset and set refer to default and updated values or states, respectively. For example, the default value potentially includes a high logical value, i.e., reset, while the updated value potentially includes a low logical value, i.e., set. Note that any combination of values may be used to represent any number of states.

The embodiments of methods, hardware, software, firmware, or code set forth above may be implemented via instructions or code stored on a machine-accessible, machine-readable, computer-accessible, or computer-readable medium that are executable by a processing element. A non-transitory machine-accessible/readable medium includes any mechanism that provides (i.e., stores and/or transmits) information in a form readable by a machine, such as a computer or electronic system. For example, non-transitory machine-accessible media include Random Access Memory (RAM), such as static RAM (sram) or dynamic RAM (dram); a ROM; a magnetic or optical storage medium; a flash memory device; an electrical storage device; an optical storage device; an acoustic storage device; other forms of storage devices for holding information received from transitory (propagating) signals (e.g., carrier waves, infrared signals, digital signals); etc. that will be distinguished from non-transitory media from which information may be received.

Instructions for programming logic to perform embodiments of the invention may be stored in a memory of a system, such as a DRAM, cache, flash memory, or other storage device. Further, the instructions may be distributed via a network or by way of other computer readable media. Thus, a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), including, but not limited to, floppy diskettes, optical disks, compact disk read-only memories (CD-ROMs), and magneto-optical disks, read-only memories (ROMs), Random Access Memories (RAMs), erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, flash memory, or a tangible, machine-readable storage device for transmitting information over the internet via electrical, optical, acoustical or other form of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.). Thus, a computer-readable medium includes any type of tangible machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (e.g., a computer).

The following examples pertain to embodiments according to this specification. One or more embodiments may provide an apparatus, system, machine-readable storage, machine-readable medium, hardware-based and/or software-based logic, and method for: receiving a request over a link, the request requesting a particular row in a memory; identifying a directory state record in memory, the directory state record identifying a directory state of the particular line; identifying a type of the request; determining, based on the directory state of the particular line and the type of the request, that the directory state of the particular line is to be changed from the particular state to a new state; and in response to receipt of the request, changing the directory state record to reflect the new state. A copy of the particular line may be sent in response to the request.

In one example, the method is performed, at least in part, by a memory controller.

In one example, a request is to be received from a host device separate from a memory controller.

In one example, a request will be received from a host device over a buffer memory access link.

In one example, the memory controller is provided on a buffer chip separate from the host device.

In one example, the request originates from a particular caching agent, and the change is to be determined according to a type of the particular caching agent.

In one example, the type of the particular caching agent is one of a local caching agent or a remote caching agent.

In one example, the request includes one of a read request and a memory invalidation request.

In one example, the directory state record includes two or more bits, and the bits are to be included in the row and reflect the particular state.

In one example, the memory controller writes the new state directly to a particular row in the memory.

In one example, the home agent updates the directory state to a new state, and the memory controller updates the directory state to the new state independent of the home agent.

In one example, the bits are included in error correction code bits of a particular row.

In one example, the memory controller determines that a directory state change associated with another request is unpredictable.

In one example, the memory controller is further to receive a communication from the home agent indicating a directory state change to the row corresponding to the other request, and perform a write to the row corresponding to the other request based on the communication to update the directory state of the row corresponding to the other request.

In one example, the communication includes a writeback of the line corresponding to the other request, wherein the writeback line indicates a directory state of the line corresponding to the other request.

In one example, the directory state of the line corresponding to the other request includes a directory state record of the line corresponding to the other request.

In one example, the directory state is one of a set of directory states, and the set of directory states includes a shared state, any state, and an invalid state.

One or more embodiments may provide apparatus, systems, machine-readable storage, machine-readable media, methods, and hardware-based and/or software-based logic (e.g., implementing caching proxy logic) for: determining a cache memory request corresponding to a particular line of memory and having a particular request type of a plurality of request types; sending the request to a buffer chip over a buffer memory link; receiving a copy of the particular line of memory in response to the request; determining that the particular line is in a particular directory state from a directory state record included in the particular line; determining a change to the directory state; and determining whether to communicate the change to the buffer chip based on the particular directory state and a particular request type of the request.

In one example, if the change can be predicted from a particular directory state and a particular request type of request, the change is not communicated to the buffer chip.

In one example, if the change is not predictable by the buffer chip based on the particular directory state and the particular request type of the request, the change is communicated to the buffer chip.

One or more embodiments may provide a system comprising: a buffer chip; a memory accessed through the buffer chip; and a processor block coupled to the buffer chip through a memory access link. The buffer chip includes a memory controller to: receiving a request associated with a particular row in memory from a processor block over a memory access link; identifying a directory state record in memory, the directory state record identifying a directory state of the particular line; identifying a type of the request based on the request; determining that the directory state of the particular line is to be changed from the particular state to a new state based on the directory state of the particular line and the type of the request; and in response to receipt of the request, changing the directory state record to reflect the new state.

In one example, a processor block interfaces with one or more other devices through a general purpose input/output (GPIO) interconnect link, a memory access link is different than the GPIO interconnect link, and a physical layer of the memory access link is based on the physical layer of the GPIO interconnect link.

Reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

In the foregoing specification, a detailed description has been given with reference to specific exemplary embodiments. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative sense rather than a restrictive sense. Moreover, the foregoing uses of embodiments and other exemplary language do not necessarily refer to the same embodiment or the same example, but may refer to different and distinct embodiments and potentially the same embodiment.

55页详细技术资料下载

Implicit directory state updates

相关技术

网友询问留言