Author Topic: Parallel execution on a Windows HPC  (Read 33764 times)

Sang-ri

  • Administrator
  • Jr. Member
  • *****
  • Posts: 70
    • View Profile
Re: Parallel execution on a Windows HPC
« Reply #15 on: October 26, 2023, 12:06:17 AM »
Hi,

I apologize for the delayed update. We have finally updated the quoFEM (v3.4.0).

To run the analysis with all 128 cores, please locate the attached config.json file in the same directory as the quoFEM executable. Then, you can start the quoFEM application as usual.

Detecting the configuration file, it will automatically overwrite the evaluation_concurrency in dakota.in to 128 (64*2). Currently, the multiplier can only be an integer.

Please let us know if you have any trouble or questions.

Thank you,
Sang-ri

rsam1993

  • Newbie
  • *
  • Posts: 21
    • View Profile
Re: Parallel execution on a Windows HPC
« Reply #16 on: October 29, 2023, 03:30:43 PM »
Sang-ri,

I really appreciate it. I will start using this new update soon and let you know in case there is any problem, which I doubt.

rsam1993

  • Newbie
  • *
  • Posts: 21
    • View Profile
Re: Parallel execution on a Windows HPC
« Reply #17 on: October 29, 2023, 08:04:05 PM »
Dear Sang-ri,

I have updated the QuoFEm application on our HPC and located the config.jason file in the same directory as the QuoFEM executable. Now when I run it with 128 samples all the 128 cores are called and it seems the Opensees analyses are performing successfully, but I get this error at the end and QuoFEm does not give me any results!

Error Running Dakota: Too many processes (128) in wait_setupCurrent limit on processes = 64

And here is the error message that I get from dakota.err file:

Too many processes (128) in wait_setup
Current limit on processes = 64



I am not sure where the problem is, because it should work. Please let me know what you need me to share with you to find the reason for this error.
« Last Edit: October 29, 2023, 08:07:16 PM by rsam1993 »

Sang-ri

  • Administrator
  • Jr. Member
  • *****
  • Posts: 70
    • View Profile
Re: Parallel execution on a Windows HPC
« Reply #18 on: October 30, 2023, 11:45:13 PM »
Hi,

Thank you so much for following up, and I'm sorry for the inconvenience! Your feedback is extremely appreciated because we could not test this feature without having a machine with more than 64 cores.

Can you please check if "dakotaTab.out" file is created in the local working directory (C:\Users\rsamtaslimi\Documents\quoFEM\LocalWorkDir\tmp.SimCenter) and see if it contains the desired sample evaluation results?

If it does, it would be great if you could share "dakota.err" file in the same folder with us. Currently, quoFEM is raising an error whenever dakota.err is non-empty. So, we can simply add an exception condition to fix that.

If "dakotaTab.out" has not been created properly, please share files "dakota.err", "dakota.in", "dakota.out", and "log.txt", if those exist in the local working directory, to help us figure out the source of error.

Thanks again!,
Sang-ri

rsam1993

  • Newbie
  • *
  • Posts: 21
    • View Profile
Re: Parallel execution on a Windows HPC
« Reply #19 on: October 31, 2023, 01:40:34 AM »
Thank you for your prompt response,

Yes, dakotaTab.out is created, but the results are not there. It seems to me that the dakota.out file is not completely generated by QuoFEM. I am attaching all the files you mentioned so you can check them all and see where this issue comes from.


« Last Edit: October 31, 2023, 01:43:47 AM by rsam1993 »

Sang-ri

  • Administrator
  • Jr. Member
  • *****
  • Posts: 70
    • View Profile
Re: Parallel execution on a Windows HPC
« Reply #20 on: October 31, 2023, 11:01:13 PM »
Hi,

Thanks for sharing the file. We are still struggling to identify the issue. From my understanding, the automated process of overwriting the evaluation_concurrency value is exactly the same as the process we tried manually, as shown below.

Hi,

Thank you for the info. We think this number should be 128 instead of 64. While we figure out the solution, can you try the following workaround and let us know if this makes CPU occupied 100%?

1. Modify the number after "asynchronous evaluation_concurrency" in dakota.in from 64 to 128
2. Remove all files and folders in the local working directory except for "dakota.in" and "templatedir"
3. Find the path of the Dakota executable from the preference window of quoFEM. Let us denote this {dakota path}
4. Open the command prompt, cd into the folder where dakota.in is located, and type "{dakota path} dakota.in" (without the quotation marks)

It will run the forward propagation analysis, and the results will be shown in dakotaTab.out.

Thank you,
Sang-ri

Regarding this, is it possible that when we manually tried, dakotaTab.out file was not properly created even though the CPU was occupied 100%? Sorry, I should have asked this earlier.

If unsure, please just let me know. We will continue investigating the issue on our side.

Thanks,
Sang-ri


rsam1993

  • Newbie
  • *
  • Posts: 21
    • View Profile
Re: Parallel execution on a Windows HPC
« Reply #21 on: October 31, 2023, 11:34:58 PM »
Honestly I do not remember if the dakotaTab.out was properly created when we did everything manually. Let me try it again and keep you posted.

rsam1993

  • Newbie
  • *
  • Posts: 21
    • View Profile
Re: Parallel execution on a Windows HPC
« Reply #22 on: November 01, 2023, 02:32:35 AM »
I have bad news,

I ran Dakota through the command prompt as you taught me before with 128 samples. During the analysis, the CPU is 100% occupied but I get an empty dakotaTab.out at the end, and the same error in the terminal (please see the attached screenshot). But there is no dakota.err file in the directory.

Sang-ri

  • Administrator
  • Jr. Member
  • *****
  • Posts: 70
    • View Profile
Re: Parallel execution on a Windows HPC
« Reply #23 on: November 12, 2023, 01:21:35 AM »
Thank you so much for revisiting the previous tests! The test was immensely helpful as it clarified that the limitation comes from the UQ engine rather than the interface.

I apologize for the late reply - I was out of the office last week. In the meantime, our team made some effort to find a workaround, but unfortunately, we could not find an immediate solution, especially without being able to reproduce the error. Also, the source of issue is related to the internal function of Dakota program, which is slightly beyond SimCenter's development focus. It seems like we cannot provide a solution at this point.

However, we will keep you posted in case of further updates.

Thanks again,
Sang-ri

rsam1993

  • Newbie
  • *
  • Posts: 21
    • View Profile
Re: Parallel execution on a Windows HPC
« Reply #24 on: November 12, 2023, 01:26:18 AM »
Hi,

Thanks for teh update. I believe I should be still able to use the interface and call all the 128 cores, so the sampling and opensees analyses will be done and then using those results, I should be able to calculate the mean, standard deviation and other outputs separately.

But, hopefully, this issue can be fixed at some point.