Check out my first novel, midnight's simulacra!
Daytripper: Difference between revisions
No edit summary |
No edit summary |
||
Line 7: | Line 7: | ||
* Caches 18 fetched instructions in Conroe, and 28 decoded μops in Nehalem | * Caches 18 fetched instructions in Conroe, and 28 decoded μops in Nehalem | ||
* LSD.UOPS: [[Performance Counters|performance counter]] providing the number of μops delivered by the LSD (introduced on Core i7) | * LSD.UOPS: [[Performance Counters|performance counter]] providing the number of μops delivered by the LSD (introduced on Core i7) | ||
David Kanter had some excellent insight:<quote>One of the most interesting things to note about Nehalem is that the LSD is conceptually very similar to a trace cache. The goal of the trace cache was to store decoded uops in dynamic program order, instead of the static compiler ordered x86 instructions stored in the instruction cache, thereby removing the decoder and branch predictor from the critical path and enabling multiple basic blocks to be fetched at once. The problem with the trace cache in the P4 was that it was extremely fragile; when the trace cache missed, it would decode instructions one by one. The hit rate for a normal instruction cache is well above 90%. The trace cache hit rate was extraordinarily low by those standards, rarely exceeding 80% and easily getting as low as 50-60%. In other words, 40-50% of the time, the P4 was behaving exactly like a single issue microprocessor, rather than taking full advantage of it's execution resources. The LSD buffer achieves almost all the same goals as a trace cache, and when it doesn’t work (i.e. the loop is too big) there are no extremely painful downsides as there were with the P4's trace cache.</quote> | |||
==See Also== | ==See Also== | ||
* [http://realworldtech.com/page.cfm?ArticleID=RWT040208182719 Real World Technologies] article, "Inside Nehalem: Intel's Future Processor and System". 2008-04-02. | * [http://realworldtech.com/page.cfm?ArticleID=RWT040208182719 Real World Technologies] article, "Inside Nehalem: Intel's Future Processor and System". 2008-04-02. |
Revision as of 01:11, 17 March 2010
My CS8803DC project, daytripper analyzes and rewrites binaries to better take advantage of Intel's Loop Stream Detector.
Loop Stream Detector
- Introduced in Conroe, improved in Nehalem
- Located following instruction fetch in Conroe and decode in Nehalem
- Caches 18 fetched instructions in Conroe, and 28 decoded μops in Nehalem
- LSD.UOPS: performance counter providing the number of μops delivered by the LSD (introduced on Core i7)
David Kanter had some excellent insight:<quote>One of the most interesting things to note about Nehalem is that the LSD is conceptually very similar to a trace cache. The goal of the trace cache was to store decoded uops in dynamic program order, instead of the static compiler ordered x86 instructions stored in the instruction cache, thereby removing the decoder and branch predictor from the critical path and enabling multiple basic blocks to be fetched at once. The problem with the trace cache in the P4 was that it was extremely fragile; when the trace cache missed, it would decode instructions one by one. The hit rate for a normal instruction cache is well above 90%. The trace cache hit rate was extraordinarily low by those standards, rarely exceeding 80% and easily getting as low as 50-60%. In other words, 40-50% of the time, the P4 was behaving exactly like a single issue microprocessor, rather than taking full advantage of it's execution resources. The LSD buffer achieves almost all the same goals as a trace cache, and when it doesn’t work (i.e. the loop is too big) there are no extremely painful downsides as there were with the P4's trace cache.</quote>
See Also
- Real World Technologies article, "Inside Nehalem: Intel's Future Processor and System". 2008-04-02.