Author Topic: failed simulations for large number of samples  (Read 21089 times)

azing

  • Newbie
  • *
  • Posts: 33
    • View Profile
failed simulations for large number of samples
« on: May 09, 2021, 05:10:31 PM »
Hi,

When I specify large number of samples in a forward propagation problem using LHS, my simulation fails. I have tried running this both on my laptop and on Designsafe:

- On the laptop, I get the error "Dakota has stopped working". If I choose the debug option, it says "an unhandled exception occurred in [11756] Dakota.exe".

- On Designsafe, the job status turn to "failed" and it says "APPS_USER_APP_FAILURE Failure indicated by Slurm status TIMEOUT with user application return code: 0:0".

* I have previously run this model for 32768 time steps with only one random variable and one sample using WE-UQ without any errors and got results comparable to deterministic OpenSees simulation results.
* I have also run the model as a test with only two time steps and 26 random variables and 30 samples successfully.

So I think the issue might not be because of the way the model and WE-UQ parameters are set up. It seems like it is due to the large number of samples. But I don't know why and how to resolve it.

Thank you,
Azin 

azing

  • Newbie
  • *
  • Posts: 33
    • View Profile
Re: failed simulations for large number of samples
« Reply #1 on: May 10, 2021, 03:15:30 AM »
I tried running the simulation on Designsafe again. This time the simulation is completed :) but does not give me any output. The individual work directories are not even generated.

I tried a second time. This time, 32 out of 1000 work directories are created, but the recorders inside them are empty. The DakotaTab file is accordingly generated without any data (it contains only headers).

I'm trying another simulation with fewer number of samples (500 samples) to see if I can get results.

Could this be an issue related to the archiving stage? 

fmk

  • Administrator
  • Full Member
  • *****
  • Posts: 233
    • View Profile
Re: failed simulations for large number of samples
« Reply #2 on: May 10, 2021, 06:52:44 AM »
the error at designsafe is due to fact that you have not specified enough time to allow the computations to finish .. give more nodes and total wall time (the cores at designsafe are probably much slower than your own computer so adjust accordingly .. you might have to time a small #samples, say 32 to see how long the simulatons take) .. the envelope recorders only write data to file at end when they are being destroyed .. so file can exist, but if no data then the opensees script has failed to finish correctly.

the dakotaTab is only filled in at the end of a succesffull dakota run .. look instead at the dakota.out file to see if any samples actually finished.

azing

  • Newbie
  • *
  • Posts: 33
    • View Profile
Re: failed simulations for large number of samples
« Reply #3 on: May 10, 2021, 07:30:28 AM »
Thank you so much for your quick response. I will try a smaller sample size and higher number of nodes.

How can I know from the Dakota.out file that samples have been finished or not? I have attached the Dakota.out file for one of my simulations in here. It is written in there that each evaluation has been added to queue and then assigned to a specific peer. But no information is given about the completion of the processes. Does this mean that non of the processes have been completed?

fmk

  • Administrator
  • Full Member
  • *****
  • Posts: 233
    • View Profile
Re: failed simulations for large number of samples
« Reply #4 on: May 10, 2021, 06:09:39 PM »
yes none have completed .. you should see lines like:

Evaluation 5 has completed

azing

  • Newbie
  • *
  • Posts: 33
    • View Profile
Re: failed simulations for large number of samples
« Reply #5 on: May 11, 2021, 03:04:05 AM »
Hi,

I tried a simulation with 100 samples, 100 nodes, 32 processors, and max time of 40 hours. After almost 10 hours of being in queue, the analysis has suddenly changed status to finished without giving me output. I checked the status of the analysis every now and then when it was in queue, and I didn't see any other status than queue. It was in queue all day, and then suddenly the status changed to finished.

If I use larger number of nodes or processors, or a higher max time, the analysis fails immediately. The above numbers were the highest I could choose.


fmk

  • Administrator
  • Full Member
  • *****
  • Posts: 233
    • View Profile
Re: failed simulations for large number of samples
« Reply #6 on: May 11, 2021, 05:11:45 PM »
the system possibly kicked you off system for your choice of nodes and processors .. you are running on stampede's KNL nodes .. each node has 64 core/processors (https://portal.tacc.utexas.edu/user-guides/stampede2) .. so you want more processors than nodes and unless this is a memory/disk intensive job, i.e. large fe model with lots of file writes, you want as few nodes as possible (reduces your wait time) .. how big is the fe model of yours and how long does it take on your local machine to run?


azing

  • Newbie
  • *
  • Posts: 33
    • View Profile
Re: failed simulations for large number of samples
« Reply #7 on: May 11, 2021, 05:59:20 PM »
it takes about 3 hours to run on my PC I think. It is a 20-story frame (2D) with panel zones and concentrated plasticity elements. So, lots of nodes and elements. I have modified the recorder and processor files as well. Could this be the problem? I separated the node recorders for each story. So the program should create lots of output files. I have attached my model in here. 


fmk

  • Administrator
  • Full Member
  • *****
  • Posts: 233
    • View Profile
Re: failed simulations for large number of samples
« Reply #8 on: May 11, 2021, 06:55:42 PM »

1) if the model goes nonlinear, one of main reasons it is slow because of the fact you are using initial stiffness iterations.

2) you create 104 output files per run, when you probably could get away with 10 .. 104 is not an awful lot compared to some of scripts I have seen but it is more than needed and it is a good habit to try and open as few files as possible, the record for most  files saved from the recorder commands is around 12k.

azing

  • Newbie
  • *
  • Posts: 33
    • View Profile
Re: failed simulations for large number of samples
« Reply #9 on: May 12, 2021, 11:08:18 PM »
Thank you for your response.

1) The frame remains linear elastic when subjected to the loads I'm currently using...

2) I should certainly change the recorders to a nicer looking and more compact format. How can I go from 104 to 10? I have 48 recorder files. The rest of the files in the local directories are the main .jason file and other input files.

- I tried running a simulation with the default stick model in WE-UQ at DesignSafe, but that didn't go through either. I got the same problem as with my 2D frame model. The analysis either fails or is finished with no output. I don't know what I'm doing wrong while running at DesignSafe. I was able to run the 2D frame locally with 1000 samples. But I can't handle more samples locally.

fmk

  • Administrator
  • Full Member
  • *****
  • Posts: 233
    • View Profile
Re: failed simulations for large number of samples
« Reply #10 on: May 13, 2021, 06:44:40 AM »
post an event file for me to have a look at.