Verification Guild
A Community of Verification Professionals

 Create an AccountHome | Calendar | Downloads | FAQ | Links | Site Admin | Your Account  

Login
Nickname

Password

Security Code: Security Code
Type Security Code
BACKWARD

Don't have an account yet? You can create one. As a registered user you have some advantages like theme manager, comments configuration and post comments with your name.

Modules
· Home
· Downloads
· FAQ
· Feedback
· Recommend Us
· Web Links
· Your Account

Advertising

Who's Online
There are currently, 56 guest(s) and 0 member(s) that are online.

You are Anonymous user. You can register for free by clicking here

  
Verification Guild: Forums

 Forum FAQForum FAQ   SearchSearch   UsergroupsUsergroups   ProfileProfile  ProfileDigest    Log inLog in 

Techniques: Know-how in creating effective and speedy TBs

 
This forum is locked: you cannot post, reply to, or edit topics.   This topic is locked: you cannot edit posts or make replies.    Verification Guild Forum Index -> Simulation
View previous topic :: View next topic  
Author Message
SAHO
Senior
Senior


Joined: Oct 16, 2004
Posts: 24

PostPosted: Tue Nov 09, 2004 1:00 pm    Post subject: Techniques: Know-how in creating effective and speedy TBs Reply with quote

Dear verification experts:

Can you suggest way to get around simulation speed bottlenecks?

My current situation is that purely VHDL simulation (system level) testbenches runs very slowly, each (simple) testcase requires hours of simulation time in order to finish. The system level testbench incorporates external IP testbenches, components around the DUT (FPGA) like SDRAM memories, I2C. As way of simplifying the testbench, (over-)simplified models are used without sacrificing the functionality.

My verification platform is Windows based, using P4 3.2G and 2G DDR RAM. HDL simulator used is ModelSIM PE.

As our product is for the network communication application, it becomes essential that the testbench is capable of doing:

1. injecting large amount of traffic into the DUT system and checking
that traffic has NOT been modified and packet type is not a major concern.
2. a variety of packet types used.

From the simulation point of view, I am seeing that a significant amount of (simulation) time is required to complete the process of transferring these vectors into the DUT. Say, to inject thousands of packets (of legal Ethernet) costs > 1 hour of simulation time. Yes, this is simply to inject large volume of trafffic into the DUT. Test vectors are (VHDL textio) file based and get parsed and transformed into streams of packets.

Using ModelSIM code profiler suggests that there is not much to be optimised as they (3 items) cost 5% of the simulator. 1 thing I did observed from the operation of waveform recording is that FPGA vendor PLL simulation package is taking >60% of the entire file size.

Current system level testbench incorporates generic testbench structure and specific values are passed down onto the specific testusing TCL commands.

There are 2 areas that I am seeking opinions:
1. simulation runtime
2. testbench creation

Has anyone done design performance using HDL code? I am thinking along the line of C based impementation as a way of benchmarking performance.


To get around the bottlenecks, what is most pressing issue

1. coding of Testbenches ?
2. verification methodology ?
3. hdl simulator ?
4. component modelling ?

Please share your proven methodology and what is your advice to me.

SAHO
Back to top
View user's profile
asif
Senior
Senior


Joined: Oct 20, 2004
Posts: 18

PostPosted: Wed Nov 10, 2004 7:55 am    Post subject: Reply with quote

Quote:

Say, to inject thousands of packets (of legal Ethernet) costs > 1 hour of simulation time. Yes, this is simply to inject large volume of trafffic into the DUT. Test vectors are (VHDL textio) file based and get parsed and transformed into streams of packets.


Well for networking applications which require huge chuck of data to generated and compared ie Data integrity critical .I would prefer to use a link list based approach.

At the Transmit/Ingress side the link list generates the packet at runtime. When received in Egress/Receiver the packet is compared and If match is found or data integrity is good. Delete the node in the link list.

At any point in time the verification env is storing only the transmitted packets to the DUT and failed or yet to be checked packets.

By using link list ie dynamic memory allocation one is not only using less memory but also boosting up simulation speed and faster processing of data.

However the link list would need to be written in C and hence requires a PLI/FLI interface .

-Asif
Back to top
View user's profile
vhdlcohen
Industry Expert
Industry Expert


Joined: Jan 05, 2004
Posts: 1240
Location: Los Angeles, CA

PostPosted: Wed Nov 10, 2004 9:42 am    Post subject: Reply with quote

Quote:
However the link list would need to be written in C and hence requires a PLI/FLI interface .

Not in VHDL, which is what SAHO is using.
Ben
_________________
Ben Cohen http://www.systemverilog.us/
* SystemVerilog Assertions Handbook, 3rd Edition, 2013
* A Pragmatic Approach to VMM Adoption
* Using PSL/SUGAR ... 2nd Edition
* Real Chip Design and Verification
* Cmpt Design by Example
* VHDL books
Back to top
View user's profile Send e-mail Visit poster's website
SAHO
Senior
Senior


Joined: Oct 16, 2004
Posts: 24

PostPosted: Wed Nov 10, 2004 10:06 am    Post subject: Reply with quote

Ben:

What are your tips? Would you be my teacher of the day?

SAHO

Asif: With ModelSIM PE, there is a limitation on PLI/FLI usage. Anyway, thank you for your thoughts.

SAHO
Back to top
View user's profile
Janick
Site Admin
Site Admin


Joined: Nov 29, 2003
Posts: 1382
Location: Ottawa, ON Canada

PostPosted: Wed Nov 10, 2004 10:19 am    Post subject: Reply with quote

Quote:
Well for networking applications which require huge chuck of data to generated and compared ie Data integrity critical .I would prefer to use a link list based approach.


It is not clear, based on the information in the original post, that total memory consumption is the culprit here. There are other potential suspects such as:

- Is there is high-frequency clock that is not really needed?

You REAL HARD at the PLL model. If it models the VCO, filters and clock divider/multiplier, it may be your cultprit. It should be modeled behavioraly.

Any clock dividers? Are the base clock frequencies used at all? Can you shut down the clock to some portion of the design (espcially the high-rate ones)

Do you have any other analog blocks that could be modelled using a small integration step and differential equations instead of behaviorally?

- File I/O

Are waveforms dumped for the entire design at all times? Avoid dumping waveforms as much as possible. How efficient is the packet reader? Can't you use on-the-fly packet generation as suggested earlier?

- Model activity

RTL model are pretty innefficient. Are equivalent behavioral models available? Does your simulator have a cycle-based simulation option?

Is your RTL modeled using seperate combinatorial and sequential processes? You can reduce the number of processes by combining the combinatorial part with the sequential part IF is there are no combinatorial outputs (i.e. what a cycle-based simulator would do).

- Model size

Do you have a large RAM memory model? Does it use a dynamic model? Do you declare large arrays (e.g. in the scoreboard)? If you are only interested in packet integrity, can it be checked by embedding a CRC in the payload instead of keeping a copy of the original packet in a scoreboard?

- Machine

How big is your machine? Does it have enough cache memory?
Back to top
View user's profile Send e-mail Visit poster's website
cabriggs
Senior
Senior


Joined: Jan 12, 2004
Posts: 96
Location: Massachusetts

PostPosted: Wed Nov 10, 2004 11:06 am    Post subject: Re: Techniques: Know-how in creating effective and speedy TB Reply with quote

SAHO wrote:

From the simulation point of view, I am seeing that a significant amount of (simulation) time is required to complete the process of transferring these vectors into the DUT. Say, to inject thousands of packets (of legal Ethernet) costs > 1 hour of simulation time. Yes, this is simply to inject large volume of trafffic into the DUT. Test vectors are (VHDL textio) file based and get parsed and transformed into streams of packets.


Also bear in mind that software-based simulation isn't the fastest thing in the world. If your DUT is big and complex, you may find the best you can do with the fastest CPU in your office is 1000 packets per hour. Four or five years ago I worked on a ~550K-gate Ethernet switching ASIC and I think we came close to 1000 packets/hour, but my current chip is a ~2M-gate WiFi ASIC that doesn't even come close to that, and I'm using much faster simulation servers these days.

If what you need is speed at all costs, you should look into the many fine acceleration and emulation products out there.

-cb
Back to top
View user's profile
vhdlcohen
Industry Expert
Industry Expert


Joined: Jan 05, 2004
Posts: 1240
Location: Los Angeles, CA

PostPosted: Wed Nov 10, 2004 11:08 am    Post subject: Reply with quote

Quote:
Test vectors are (VHDL textio) file based and get parsed and transformed into streams of packets.

How about using binary files instead of text files. This may give some improvements. In my Coding Style book I demonstrate how a simple VHDL program can be used to convert text files to binary files. This eliminates all the parsings during your simulations, and saves time particularly on regression tests, since the parsing is done once.
Ben Cohen
_________________
Ben Cohen http://www.systemverilog.us/
* SystemVerilog Assertions Handbook, 3rd Edition, 2013
* A Pragmatic Approach to VMM Adoption
* Using PSL/SUGAR ... 2nd Edition
* Real Chip Design and Verification
* Cmpt Design by Example
* VHDL books
Back to top
View user's profile Send e-mail Visit poster's website
jmcneal
Senior
Senior


Joined: Jan 12, 2004
Posts: 34
Location: Hillsboro, Oregon

PostPosted: Wed Nov 10, 2004 11:36 am    Post subject: Reply with quote

SAHO

To address some things that haven't come up yet:

VHDL vs Verilog. You probably can't switch in mid project, but Verilog usually simulates much faster than VHDL. Several years ago my project team abandoned VHDL for exactly this reason. At the time (~5 years ago) we were using Modelsim VHDL. We got a 10x speedup going to Modelsim Verilog. We got an additional 7x going to NCSim. (If you are changing languages, simulators is no big step). Our sims went from hours to minutes. These differences may not be as large now, I haven't looked at VHDL since then.

I'm only trying to address sim speed issues here. Not interested in starting up a VHDL v Verilog flame war. There are many +s for both, and lots of reasons not to switch.

Second have you tried adding memory to your machine? Another 2G of RAM should be a cheap, easy test.

Does anyone have any numbers on Linux vs Windows specificly for run times on verilog/VHDL sims?

jeff
Back to top
View user's profile
SAHO
Senior
Senior


Joined: Oct 16, 2004
Posts: 24

PostPosted: Wed Nov 10, 2004 11:57 am    Post subject: ALTERA PLL Reply with quote

Hello folks:

Thank you so much for the contributions. It will take me some time to read and digest your proven methods and approaches.

Just thought that I would update other readers on this matter. For our design, we are using PLL VHDL package from Altera vendor.

By simply disabling these component instantiations in the FPGA Design's clock network, and replacing their functionalities with purely behavioural Testbench style clocks like below

clock <= not clock;

I am observing huge improvement in terms of simulation time. With over-simplified clock, the test takes ~ 3 min (wall clock time) and full regression run completes with 1 hour. As a benchmark, a single test with PLL components instantiated in the design, this has a huge negative impact on simulation runtime. Dont even talk about the regression run, my guess is that it will take AGES. Rather than 3-min run, the test takes more than a hour. AMAZING!

On a closer observation,

Code:
[i]TestName    simulation time to complete   wall clock time[/i]
=========================================================
test 1            253668597 ps                        3 mins
test 2 with PLL  > 5 ms                               > 60 min
                           (still running at the time i write)


where test1 = no FPGA vendor PLL instantiated, using simple clock
test2 = with FPGA vendor's PLL instantiated

Can anyone give me reason as to why the simulation times for both test (same test, with different clock scheme ONLY) are different? Any clues?

With PLL instantiated, I have noticed that the simulator is required to perform many delta cycles (mainly in the FPGA vendor packages). Can this contribute to the difference in simulation time taken to complete the test. Mind you, the end result are IDENTICAL, just that one takes more time than the other.


I have learned a lesson!

SAHO
Back to top
View user's profile
Janick
Site Admin
Site Admin


Joined: Nov 29, 2003
Posts: 1382
Location: Ottawa, ON Canada

PostPosted: Wed Nov 10, 2004 12:29 pm    Post subject: Re: ALTERA PLL Reply with quote

SAHO wrote:
For our design, we are using PLL VHDL package from Altera vendor.

By simply disabling these component instantiations in the FPGA Design's clock network, and replacing their functionalities with purely behavioural Testbench style clocks like below

clock <= not clock;

I am observing huge improvement in terms of simulation time.
(...)
Can anyone give me reason as to why the simulation times for both test (same test, with different clock scheme ONLY) are different?


The simulation can only go as fast as the smallest interval between two events. If you have a fast clock (e.g. 2ns period), the simulation will be ticking along at every 1ns because there is an event every 1ns.

Looks like the PLL model is structurally modelled. Because PLL are analog components, they are difficult to model structurally in an event-driven simulator. There is a VHDL book out there with a PLL model that uses analog models for the RC filter components with a 1fs integration step! Nice academic exercise - but totally useless in real life. I suspect the delta cycles you are seeing are used to settle values on inout ports representing analog voltage values.

What is it used for? Clock multiplication? Clock recovery? Clock smoothing? Model that PLL behaviorally. See Sample 5-11 in the 2nd edition of my book for an example of a behavioral model of a clock multipler PLL model.
Back to top
View user's profile Send e-mail Visit poster's website
SAHO
Senior
Senior


Joined: Oct 16, 2004
Posts: 24

PostPosted: Wed Nov 10, 2004 1:05 pm    Post subject: Reply with quote

Salute JANICK:

Quote:
The simulation can only go as fast as the smallest interval between two events. If you have a fast clock (e.g. 2ns period), the simulation will be ticking along at every 1ns because there is an event every 1ns.

Looks like the PLL model is structurally modelled. Because PLL are analog components, they are difficult to model structurally in an event-driven simulator. There is a VHDL book out there with a PLL model that uses analog models for the RC filter components with a 1fs integration step! Nice academic exercise - but totally useless in real life. I suspect the delta cycles you are seeing are used to settle values on inout ports representing analog voltage values.

What is it used for? Clock multiplication? Clock recovery? Clock smoothing? Model that PLL behaviorally. See Sample 5-11 in the 2nd edition of my book for an example of a behavioral model of a clock multipler PLL model.


The PLL is mainly used as both clock buffering and clock multiplication within the design; hence removing the need to fit external clocks for memory components.

I am not sure how Altera technical support coded the PLL model. For your information, it is defined in the Altera simulation packages

Code:
lpm/220pack.vhd
lpm/220model.vhd

altera_mf_components/altera_mf_components.vhd
altera_mf_components/altera_mf.vhd


I SHOULD really pointed out that these are for simulation purpose ONLY. We had initially validate these PLLs can be re-configured (in testbenches and laboratory) to cater for different Ethernet link speed configurations.

NOTE: It is FUNNY that after several time of re-configuration of PLL
(involving setting the reconfig cache with data, then stream it across to the PLL), PLL eventually lost locked and even worse Altera response to this observation was NO-SUPPORT (it is as designed; not to be used for reconfiguration, but their application note indicated it is possible to achieve so. How contradicting!)


Quote:

Model that PLL behaviorally. See Sample 5-11 in the 2nd edition of my book for an example of a behavioral model of a clock multipler PLL model.


What we did is to replace the PLL with clocks based on VHDL generics' constant.

SAHO
Back to top
View user's profile
SAHO
Senior
Senior


Joined: Oct 16, 2004
Posts: 24

PostPosted: Wed Nov 10, 2004 2:59 pm    Post subject: Reply with quote

Quote:
Code:
On a closer observation,


TestName    simulation time to complete   wall clock time
=========================================================
test 1            253668597 ps                        3 mins
test 2 with PLL  > 5 ms                               > 60 min
                           (still running at the time i write)


where test1 = no FPGA vendor PLL instantiated, using simple clock
test2 = with FPGA vendor's PLL instantiated



I am going tired and my mind is a mess! Smile

Code:
On a closer observation,


TestName          simulation time to complete   wall clock time
=========================================================
test 1            2.6 ms                      30 mins
test 2 with PLL   2.6 ms                      45 min



where test1 = no FPGA vendor PLL instantiated, using simple clock
test2 = with FPGA vendor's PLL instantiated.

Nonetheless, there is improvements of 1.5X (45/30). This would be BETTER as the total amount of time to complete a run must be the SAME, but the actual wall clock time CAN differ.

I realised that the generic was not passed down, so test 1 had have a higher frequency clock, 1000 X where test 2 is a Ethernet based system, with link speed 10Mbps. Doh!


Performance enhancement goes on.

SAHO
Back to top
View user's profile
Display posts from previous:   
This forum is locked: you cannot post, reply to, or edit topics.   This topic is locked: you cannot edit posts or make replies.    Verification Guild Forum Index -> Simulation All times are GMT - 5 Hours
Page 1 of 1

 
Jump to:  
You can post new topics in this forum
You can reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
Verification Guild © 2006 Janick Bergeron
Web site engine's code is Copyright © 2003 by PHP-Nuke. All Rights Reserved. PHP-Nuke is Free Software released under the GNU/GPL license.
Page Generation: 0.239 Seconds