AccuratePerformanceEvaluation,-Modelingand-PredictionofaMolecularSimulationcodedwithMessagePassingMiddleware
MichelaTauferandThomasStrickerLaboratoryforComputerSystemsSwissInstituteofTechnology(ETH)CH-8092Zuerich,Switzerlandtaufer@inf.ethz.ch,tomstr@acm.org
Abstract
Indistributedandvectorizedcomputingthereisalargenumberofhighlydifferentsupercomputingplatformsanapplicationcouldrunon.Thereforemosttraditionalparallelcodesareillequippedtocollectdataabouttheirresourceusageortheirbehavioratruntimeandthecorrespondingdataarerarelypublished.Onlyfewcomputationalscientistsexploretheinteractionsoftheirtargetplatformswiththeirapplicationsystematically.Asanimprovementoverthecurrentstateoftheart,weproposeanintegratedapproachtoperformance-evaluation,-modelingand-predictionfordifferentplatforms.Ourapproachreliesonacombinationofanalyticalmodelingandsystematicexperimentationwithfullapplicationruns,applicationkernelsandsomebenchmarks.WeoutlineourmethodologyofperformanceassessmentwithOpal,anexamplecodeinmolecularbiology,developedatourinstitutiontoruninparallelonCrayJ90“Classic”VectorSMPs.BesidesadetailedassessmentofperformanceachievedontheseJ90s,theprimarygoalofourstudywastofindthemoresuitableandpotentiallymorecosteffectivehardwareplatformfortheapplication,inparticulartocheckthesuitabilityofthisapplicationforslowCoPs,SMPCoPsandfastCoPs,threeflavorsofClustersofPCsbuiltwithoff-the-shelfmicroprocessorsatourcomputersystemslaboratory.TheperformanceassessmentbasedonourmodelismucheasierthanportingandparallelizingtheapplicationforanewtargetmachineandsowecouldalsoobtainandincludeperformanceestimatesforaT3E-900,ahighendMPPsystem.ThepredictedexecutiontimesandspeedupfiguresindicatethatawelldesignedclusterofPCsachievessimilarifnotbetterperformancethantheJ90vectorprocessorscurrentlyusedandthatthecomputationalefficiencycomparesfavorablytotheT3E-900forthatparticularapplicationcode.
1Introduction
Alargevarietyofdifferentparallelcomputingplatformsmakesitquitedifficulttopickthethebestsuitedandmostcosteffectiveparallelcomputerasatargetplatformforanewapplicationcodetorunon.Thequestionofthebestplatformforanapplicationshouldbeaddressedearlyinthedesignprocessassoonasthecharacteristicsoftheapplicationcodebecomeknownduringearlyprototyping.Typicallythescientistsintheapplicationfieldareonlyinterestedintheresultsofthecomputationitselfandrarelyinhowfastandhowefficientlytheywerecomputed.Thereforemosttraditionalparallelcodesareillequippedtocollectdataabouttheirresourceusageortheirbehavioratruntimeandthecorrespondingdataarerarelypublished.Thereareafewexceptionstothisrule-like[14],apaperdevotedentirelytothecharacteristicsofalargeFEMapplication.
MostmachinesinhighperformancecomputingandeventodaysPCshavegoodhardwareinstrumentationtocollectallthenecessarydata,butmostsystemvendorsdon’tpromotedirectaccesstothem.Insteadtheyprovidehigh-levelperformancetuningandadvisorytoolswithlittleinformationabouttheirresolution,theiraccuracyandtheirtheoryofoperation.Furthermorethosetoolsofteninterferewiththetaskingsupportoftheparallelizationandcommunicationtools(i.e.themiddlewareforparallelization).Commonotherproblemsincludetheclient/serverparadigmwithfulloverlapofcomputationandcommunicationandmanylatencyhidingmechanisms,thatmakeaccurateanddetailedperformancemeasurementsalmostimpossible.
1
Asanimprovementoverthecurrentstateoftheart,weproposeanddemonstrateanintegratedapproachtoapplicationdesign,parallelizationandperformanceanalysisusingacombinationofanalyticalmodelingandmeasurements.WestudiedthisintegrationwithOpal[13],anexamplecodeinmolecularbiology,developedatourinstitutiontorunonourfourCrayJ90“Classic”VectorSMPs,with8-16processorseach.BesidesadetailedassessmentofperformanceachievedontheJ90s,theprimarygoalofourstudywastofindthemostsuitableandmostcosteffectivehardwareplatformfortheapplication,inparticulartocheckitssuitabilityforfast,slowandSMPCoPs,twoflavorsofClustersofPentiumPCs,builtbyourcomputerarchitecturegroup.
InChapter2wedevelopandpresentananalyticalcomplexitymodeltopredicttheexecutiontimeforOpalsimula-tionswithdifferentinputparameters.Wecalibratethemodelwithasystematicexperimentaldesignandshowwhatwelearnedfromdetailedanalysisofcomputationandcommunicationperformance.InChapter3wediscusstheintegra-tionofperformancemonitoringintomiddlewarepackagesanddespiteoverlappedcommunicationandcomputation.InChapter4weusethemodeltogetherwithsomearchitecturalkeydatatopredicttheefficiencyofOpalonalternativeplatformsincludingClustersofPCs.
2
2.1
Casestudy:Instrumentingamolecularbiologycode
AbriefdescriptionofOpal
Opalisasoftwarepackagetoperformthesimulationofthemoleculardynamicsofproteinsandnucleicacidsinvacuumorinwaterthroughenergyminimization.Opalusesclassicalmechanics,i.e.,theNewtonianequationsofmotion,tocomputethetrajectoriesritofnatomsasafunctionoftimet.Newton’ssecondlawexpressestheaccelerationas:
d2mi
¶rit
Vr1t
rnt
AtypicalfunctionVhastheform:
Vr1
rn
allbonds
∑
1
2
1
C6ij
KΘθ
θ0
2
improperdihedrals
∑
ri12j
4πε0εrrij
Thefirsttermmodelsthecovalentbond-stretchinginteractionalongbondb.Thevalueofb0denotestheminimum-energybondlength,andtheforceconstantKbdependsontheparticulartypeofbond.Thesecondtermrepresents
thebond-anglebending(three-body)interaction.The(four-body)dihedral-angleinteractionsconsistoftwoterms:aharmonictermfordihedralanglesξthatarenotallowedtomaketransitions,e.g.,dihedralangleswithinaromaticringsordihedralanglestomaintainchirality,andasinusoidaltermfortheotherdihedralanglesϕ,whichmaymake360turns.Thelasttermcapturesthenon-bondedinteractionsoverallpairsofatoms.ItiscomposedofthevanderWaalsandtheCoulombinteractionsbetweenatomsiandjwithchargesqiandqjatadistancerij.
AfirstserialversionofOpal,Opal-2.6,wasdevelopedattheInstituteofMolecularBiologyandBiophysicsatETHZ¨urich[12].ItwaswritteninstandardFORTRAN-77andoptimizedforvectorsupercomputersthroughafewvector-izableloops.IntheserialcodeofOpal-2.6asingleprocessorrunsthewholecomputation.Opal-2.6spendsmostofthecomputingtimeduringasimulationevaluatingthenon-bondedinteractionsoverallpairsofatomsofthemolecularsystem(thelasttermoftheatomicinteractionfunctionV).Fortunately,thesecalculationsalsoofferahighdegreeofparallelisminadditiontothevectorizableinnerloops.
2
TheparallelversionofOpal
TheparallelversionofOpal[17,2]distributesitsworkamongmultipleprocessorsinaclient-serversetting:multipleserverssharethecomputationoftheVanderWaalsandCoulombforceswhileoneclientcomputesthefewremaininginteractionsandcoordinatesthework.Thecomputationrepeatsforeverytimestep.
Foramolecularcomplex1ofnatoms,thenumberofnon-bondedinteractionsbetweenatoms,whichmustbeevaluated,isoftheorderofn2.InthenewversionofOpal,thissequentialcomplexityofthemolecularenergiesevaluationisreducedbyneglectingmanyofthenon-bondedinteractionsfromthemolecularenergycomputation:onlythepairsofatoms,whosedistanceislessthanacut-offparameter,aretakenintoaccount.Atfirst,thedatadescribingthenon-bondinginteractionparametersbetweenthesolute-solute,solute-solvent,solvent-solventatomspairsarereplicatedonalltheservers.Thisglobalinformation,whosevolumedependsontheproblemsizeanddoesnotscalewiththenumberofprocessors,allowseachservertoachievealargeindependence.Withitsdata,eachserverrunsitstasksofthesimulationrequestingnofurtherparametersateachstepfromtheclientthantheatomcoordinates.
Asimulationproceedsbyrepeatingthesamecomputationtaskscontinuously.Attheendofeachsteptheinformationaboutthetotalenergy,volume,pressureandtemperatureofthemolecularcomplexisdisplayed.Inthefirststageofeachsimulationstep,whichwecallupdatephase,eachserverselectsadistinctsubsetoftheatompairs,checksthedistanceamongtheatomsofeachpairandaddsthepairtoitsownlistofallactivepairswhentheatomsarenotbeyondthegivendistancecut-off.Inthesecondstageofthesimulationstep,theserverscomputepartialnon-bondedenergies(VanderWaalsenergyandCoulombenergy)usingthelistofallactivepairs.Attheendofthisstepeachserversendsitspartialresultstotheclientwhichgathersthemandsumsthetotalmolecularenergyofthemolecularcomplexaswellasitsvolume,pressureandtemperature.
Thedataineachlistareupdatedperiodically.TheintervalbetweensuccessiveupdatescanbeselectedbytheuserthroughthesettingofanOpalparametercalledupdate.Thevalueoftheupdateparameterexpressesthenumberofinteractionstepsafterwhichthelistsofallactivepairsareupdated.
Thedistributionoftheatompairsfortheevaluationoftheenergiesduetothenon-bondedinteractionsisdoneusingapseudo-randomstrategy.Randomizationshouldhelptobalancetheworkloadamongtheserversandtoavoidduplicationofwork.
Moreover,withaslightchangeofthemolecularsimulationmodel,i.e.,theuseofwatermoleculesassingleunitscenteredintheoxygenatomsinthesolventinsteadofthreeindividualatoms,weaccomplished:
areducedworkloadoftheservers,
areductioninsizeofthelist(memoryusage),
anincreaseinaccuracyforthemolecularenergycalculationswithsmallcut-offradii.
Alternativeparallelizationsformolecularsimulations
Theparallelizationofthenon-bondedpairwiseenergycomputationthroughthedistributionofthemasscentersamongstseveralprocessorsusedforOpalisnottheonlyparallelizationapproach.Therearethreemainapproachestoaparallelizing:thereplicated-data(RD)methodusedforOpal,inwhichthemasscenters(i.e.atoms)aredistributedamongtheprocessors,thegeometric-orspace-decomposition(SD)method,inwhicheachprocessorconsidersthemasscentersintoitssub-domainduringthesimulationandtheforce-decomposition(FD)methodinwhichtheforcematrixFij(Fijistheforceonmasscenteriduetomasscenterj)ispartitionedbyblocksamongtheprocessors[15,1].
Comparablepackagesinmolecularbiology
WithitsparallelizationtheparallelversionofOpalhasbecomesimilartoAmber[19].Boththecodesallowtheusertocarryoutenergyminimizationandmoleculardynamicsusingthesameanalyticalfunction(see2.1).Moreover,boththecodesusemolecularcomponentsinteractionlists,whichneedperiodicalupdates,andallowtoevaluatetheinteractionsofeachatomwiththerestofthemolecularcomplexintoacut-offdistance.FortheparallelversionofAmber,ageneralizedMPIinterface[7]isusedformessage-passing,whiletheparallelversionofOpalreliesonthePVMinterface,andtheSciddleRPCmiddlewarepackage[4,18].StillbothcodesareexplicitlyparallelandwellsuitablefordistributedmemorymachineswithamessagepassingAPI.
2.2AtimecomplexitymodelforOpal
DuringthedesignandparallelizationofOpalwederivedananalyticaltimecomplexitymodelthatcapturesallessentialparametersoftherealapplication.ThepredictedoutcomeofthemodelistheexecutiontimeofOpalinseconds,writtenasasumofseveralpartialresultvariableswhicharecomputedseparatelyandalsomeasuredseparatelyduringvalidation:
tOPALttotttotttotcompcompsync
ttotcomp,thetotalparallelcomputationtime,isthecomputationtimespentbytheserversservicingthe
requestforthecomputation.Theserversruntworoutinesasparallelwork:theupdateroutinethatupdatesthelistsofatompairs,andtheenergyevaluationroutinethatevaluatesthepartialenergiesofthenon-bondedinteractions(VanderWaalsenergyandCoulombenergy).
ttot
comp
tupdate
tnbint
(2)
Thecomputationtimeoftheupdateroutinealwaysgrowsquadraticwithproblemsizebecauseeachtimetheserversupdatetheirownlist,allthepairsofatomsmustbechecked.Atthesametime,theupdatetimedecreaseslinearlywiththeincreaseofthetimeintervalbetweentwolistupdates.
tupdatenγ
a2
su
2
(3)
where:
–sisthenumberofsimulationsteps.
–pisthenumberofserversonwhichruntheOpalapplication.–uisthefrequencyofthelistupdates(updateparameter).
–nrepresentsthenumberofmasscenters(atomsandwatermolecules)ofthewholemolecularcomplex.–γ(gamma)istheratioofnumberofwatermoleculestothetotalnumberofmasscenters.
–a2representsthecomputationtimespenttogenerateapairofatomsandcalculatethedistancebetweenthem.
Ontheotherhand,thetimefortheenergyevaluationroutineissubjecttotheeffectsofthecut-offparameter:thedimensionofthelists,onwhichpairsthepartialenergiesareevaluated,increasesdrasticallywiththeincreaseofthecut-offdistance.Theenergy-evaluationroutinegrowsquadraticuptothenumberofatomswithinthecut-offradiusandlinearbeyondthat.
tnbintnn˜
a3
a3
ss
2
whenn
n˜
molecularcomplexvolumeaswellasthecut-offparameter.Foroursimulations(see2.5)crossoverhappensforunrealisticnumbersofwatermoleculesorproteinatomsi.e.forsufficientlyhighvaluesofn.Furthermoreareductionoftheupdatefrequencyispossibletoreducethefractionofupdatecomputationarbitrarilyandrestoretherelationof:
tnbinttupdate
Wesummarizethetotalparalleltimeas:
s
1
2p
a2u1
2γ
a3n
whenn
n˜
ttot
comp
s
1
p
a3n˜
a2u
12γ
,thetotalcommunicationtime,isthetimespentbythecommunicationprocessesbetweentheclient
andtheserversduringtheentiresimulation.Theclientcallstwodifferentkindsofprocedures(subroutines)thatarerunontheservers:thesubroutineforlistupdatesandthesubroutineforenergyevaluation.Weenhancedthecommunicationenvironmentwithsomesynchronizationtoolsthatallowustoseparatethecommunicationtimesproperlyfromothercomputationandidletimesandthereforepermittoexplainallcommunicationcomponentsprecisely.Moredetailsaboutthesesynchronizationtoolsandtheirunderlyingmodelareexplainedin[17].Thankstothismodeltheresultingcommunicationtimeoftheclient’sRPCscanbedecomposedinto:
ttot
upd
treturn
nbi
treturn
upd
tcall
a1
n
b1(7)
inadditiontothequantitiesdefinedabovewedefine:
–α(alpha)isthenumberofbytesusedtorepresentthecoordinatesofasingleatom.
–a1isthecommunicationrateincludingtheoverheadinthecommunicationenvironment(SciddleandPVM)
–b1isthecommunicationoverhead,inseconds,usedtotransferanemptyblockfromthesendertothereceiver
FortheupdateRPC,theclientdoesnotretrieveanydatafromtheserverswhentheyarriveattheendoftheupdateroutine:theclientjustwaitsforaresultmessagewhichassurestheendoftheservertasks.
treturn
nbi
2
α
a1α
n
b1
α
χοµµ
sp
ttot
offourterms:–tstr
upd,
sync
isthesum
thetotaltimetosynchronizetheclientandtheserverswhentheupdateroutinesfinish,
–tstr
nbi,
thetotaltimetosynchronizetheclientandtheserverswhentheenergyevaluationroutinesfinish.
Weassumethatthefourdifferentsynchronizationtimesdonotdependonthenumberofserversnorontheproblemsize.Ourformulationjuststatesthateachtermincreaseslinearlyinthenumberofsimulationstepsandmoreoverthecontributionoftstrupddecreaseastheupdateparameterdecreases.Weassumethateachsynchronizationprocesstakesaconstanttimeb5.
ttot
thecut-offparameter(approximationproperties),
theupdatefrequencyparameter(communication-computationbalance).
Forthecalculationweconsideronetosevenservers,small,mediumandlargeproblems,full-vs.partial-updatesand
˚vs.alarge,ineffectiveoneat60A.˚Theresponsevariablestwodifferentcut-offradii-asmall,effectiveoneat10A
measuredarethecommunication,parallelcomputation,sequentialcomputation,idletimeandsynchronizationtimeaslistedinthecompositeformula.Theexperimentsalwaysrunonadedicatedsystemandthereforethereisnooverheadonthemeasurementsduetoatimesharingenvironment.Inafewpreliminarytests,everymeasurementhasbeenrepeatedseveraltimes.Thetestshaveconfirmedalowvariabilityandagoodreproducibilityoftheexecutiontimes,andwethereforehaveconcludedthattensimulationstepssufficetoassureanaccurateandmeaningfultimingofanentiresimulationoftheproteinfoldingprocess,whichwecallasingleexperimentorcaseinthispaper.
2.4UnderstandingtheexecutionofOpalwithourmodel
WeinvestigatetheperformanceoftheOpalcodebymeasurementsofsimulationexecutiontimesoftwomolecularcomplexeswithdifferentsizes:theparallelcomputationtime,thesequentialcomputationtime,thecommunicationtime,thesynchronizationtime,andtheidletime.Wemeasurethedetailedbreakdownofthewallclockexecutiontimefortensimulationsteps.
ThefirstmolecularcomplexisamediumsizeexampleofthesimulationproblemsthatOpalcanhandle:itisthecomplexbetweentheAntennapediahomeodomainfromDrosophilaandDNA[8],composedby1575atomsandimmersedin2714watermoleculesoratotalof4289masscenters(mediumproblemsize).Thesecondmolecularcomplexisconsideredtobealargesizeproblem:itistheNMRstructureoftheLFBhomeodomain,composedby1655atomsandimmersedin4634watermolecules,atotalof6289masscenters(largeproblemsize).
Werunthecodefordifferentlevelsofparallelism:thenumberofserversrangesfromonetoseven.Atthesametimewemeasuretheexecutiontimeswhenthesimulationisfullyaccurateandthecomputationcomplexityisquadraticintheproteinsize(i.e.nocut-off)andwhenthesimulationisapproximateandconsequentiallythecomputationcomplexitybecomeslinear(i.e.withcut-off).Finally,weinvestigatetheroleofthelistsupdate:werunthesimulationeitherwithanupdateofourlistsuponeveryiteration(fullupdate)orwithapartialupdateevery10iterations(partialupdate).Thecomparisonofthedifferentcasespermitstostudythescalability,i.e.,theexecutiontimesastheydependonincreasingparallelismandtoinvestigatethepreciseimpactofthedifferentproblemsizes,thefrequencyofthelistsupdateandthevaluesofthecut-offparameterontheperformanceofthesimulation.
Figures1a)-d)displayadetailedbreakdownofthewallclockexecutiontimefor10simulationstepsinthemediumsizemolecularcomplexwithdifferentchoicesforthenumberofservers,thecut-offandtheupdateparameters.ThechartinFigure1a)showsthatwithoutcut-off,thetimeinparallelcomputationisthelargestfractionoftheexecutiontimeandthatitdecreasesasexpectedwhenmoreserversareadded.Atthesametimethecommunicationtimeincreasesaboutlinearwiththenumberofservers,butitsoverallcontributionremainssmall,evenforsevenservers.Thesynchronizationtimeandthesequentialcomputationtimeremaininsignificanttotheoverallexecutiontime.HowevertothesurpriseoftheOpalimplementors,ourinstrumentationrevealsaloadbalancingproblemforrunswithanevennumbersofprocessors.Figure1b)showsanOpalexecutionwithreducedupdates.Asexpected,thelowerupdatefrequencydoesnotaffecttheoverallperformanceonsimulationsmuchbecausethelargeamountofparallel
˚computationdominatestheexecutiontime.Figure1c)showsasimulationwithaneffectivecut-offparameter(at10A).
Thecut-offparameterdeterminestheasymptoticcomputationalcomplexity:theamountoftheparallelcomputationissmallerthaninthecasesaboveanditsoverallcontributionbecomescomparabletotheothermeasuredexecutiontimes.Thesequentialcomputationtime,thesynchronizationtimeandthecommunicationtimegainahighimportancefortheoverallperformance.Figure1d)displaysarunwithboththecut-offandthepartialupdateoptionineffect.Thefrequencyofthelistupdatesleadstoanotabledifferenceintheperformanceofsimulationswithsmallcut-offradii.Theproblemsizeitself,thenumberofatomsofthewholemolecularcomplex,hasavaryingimpactonthedifferentcomponentsoftheexecutiontime.Thetimecomponents,theparallelcomputationtime,thecommunicationtime,theidletime,thesequentialcomputationtimeandthesynchronizationtime,increaseeachoneinadifferentwaywiththenumberofatoms:whilethesizeoftheproblemhasasuper-linearimpactontheparallelcomputationtimeandtheidletime,ithasonlyalinearmoderateimpactonthesequentialcomputationtimeandthecommunicationtime.Figure2a)-d)showthedetailedbreakdownofthewallclockexecutiontimefor10simulationstepsinthelargesizemolecularcomplex.Whiletheorderofthemeasuredexecutiontimedoublesasweincreasetheproblemsizefroma
7
Execution time of Opal on Cray J90(medium molecule, no cut-off, full update)
123Number of servers
4567400350300Time (sec)2502001501005001000900Execution time of Opal on Cray J90(large molecule, no cut-off, full update)Execution time of Opal on Cray J90(large molecule, no cut-off, partial update)12Execution time of Opal(large molecule, with12parallel comp. timescalar comp. timeidle time2:Detailedwithaupdate)andwithcut-offtimewehaveadjustedtheshowpredictedbyfrequencies(723thedifferencesofthemodelislisted3456Number of servers
(a)
on Cray J90cut-off, full update)3456Number of servers
communication timesynchronization time(c)breakdownofthemeasuredmolecule
simulationwithoutcut-offparameter(i.e.foreachatomcomputedtheoutcomeoftheforalastsquarecomparisonofthewallclockanalyticalmodelfortheandlargeormediummolecularalthoughthedatawasbetweenmodelandmeasurementtothemeasurementforthe[17].
(b)
on Cray J90cut-off, partial update)234567Number of servers
communication timesynchronization time(d)iterationsofanOpalsimula-interactionsaretherangeof10A
˚considered)versusareconsidered).analyticalmodelforeachoneoftheseexecutiononaCrayJ90SMPagainstnumbersofservers,differentcut-offinthepaper,welistonlythedatadesignof84experiments.Duringandplottedforeachcase.Thefortheremainingcasesisexcellent.800700Time (sec)60050040030020010007605040Time (sec)3020100605040Time (sec)3020100Execution time of Opal(large molecule, with71parallel comp. timescalar comp. timeidle timeFigureexecutiontimesfor10tionlargestep(fullaparameter(i.e.alltheatomsa
Atsimulationonlyinteractionswithinthesamesimulationthroughthecasesandparametersfittothecorrespondingmeasurements.
Figures4a)-d)thetimesmeasuredforthethetimesthesamemachinewithdifferentradii,updatecomplexes.Forbrevityofareduced1)-designachievedwithafullfactorialthecalibrationhavebeeninvestigatedoverallfitcasesinFigures4a)-d)andThefulldatain9
The parameter space ofOPAL molecular simulationspartial update (0.1)
update frequencyfully effective cut(60 Å)full update (1)
pt)(6200large medium (46no coff (10 Å)utoff iusff radCut oProblem Size)t00 p00 pt)(16Legend:small case & calibration data in paper case shown, calibration not data available in extended reportFigure3:ParameterspacemodelofOpalmolecularsimulation.
2.6SpacecomplexityofOpal
Executiontimeisnottheonlycomplexitymeasurethatcanbetreatedinthismanner.Aspacecomplexitymodelformemoryissueislargelyorthogonaltotheexecutiontimemodel.
Memorymanagementremainsahighlycomplexissueintheparallelization.TheparallelOpalhasbeendesignedtousememoryinthemosteconomicalway:eachserverhasonlyapartofthenon-bondedinteractionspairsofatoms.Thedimensionofthedatalistsoneachserverscalesdownlinearlywiththenumberofprocessors.
Ontheotherhand,eachserverneedsthesameglobaldata(informationaboutthesolution-solution,solution-solvent,andsolvent-solventnon-bondedinteractions)whoseamountofdatadependsontheproblemsize,i.e,thenumberofwaterandmolecularatoms.Thisglobalinformationisconstantanddoesnotscaleupordownwiththenumberofprocessors.Thecomputationoftheseglobaldatastructuresoneachserverinvolvesaduplicationofworkbutsavessomecommunication.Furthermore,thiscomputationoftheseconstantstakesplaceatthebeginningofthesimulation,anditscostisamortizedoveralltimestepsoftheentiresimulation.Oncetheglobaldatainitialized,eachservercanexecutesitscomputationlargelyindependent,becauseitevaluatesitspartialnon-bondedinteractiontermswithoutrequestingfurtherparametersfromtheclientexceptforsomeupdatedvaluesoftheatomcoordinates.ThesizeofthedatastructuresinOpalgrowswiththeproblemsizeasshowninthetablebelow:
Order
2n2
Constantc[Bytes]2*43*83*82*82*8
pairlist
atomcoordinatesatomgradientsatominteractionsenergyvalues2γcncn
c13γ2n2
c
c1
LargeExample6290masscenters
[Bytes]160’000’000
400’000400’00040’000’000
16
Amoreaccuratespacecomplexitymodelwouldonlybeusefulifsomeinterestingtradeoffsbetweenspaceandtimecomplexitycouldbeidentified.Wedidnotfindanyinterestingtime-spacetradeoffs,exceptfortheobvioussizeoftheworking-setsthatinfluenceexecutionspeedthrougheffectsofthememoryhierarchylikeswappingofrealphysicalmemory(DRAM)forlargevirtualmemoriesandtheeffectsofthetwolevelsofcaches.
WeransometestswiththesingleprocessorversionofOpalonourPentiumPCplatformstoinvestigatethecompu-tationalperformanceofthemostsignificantloop(comp
1000900800700Time (sec)Measured time vs. analytical model time:(large, no cut-off, full update)1000900800700Time (sec)6005004003002001000Measured time vs. analytical model time:(large, no cut-off, partial update)600500400300200100012345Number of servers
(a)
6712345Number of servers
(b)
67605040Time (sec)Measured time vs. analytical model time:(large, with cut-off, full update)25Measured time vs. analytical model time(medium problem, with cut-off, partial update)20Time (sec)15302010012345Number of servers67105012345Number of servers67measurement(c)analytical modelmeasurement(d)analytical modelFigure4:Thedifferencebetweenthewallclocktimesmeasuredandthetimespredictedbytheanalyticalmodel.
Theabsoluteandrelativecomputationalperformancesbasedondifferentworkingsetsarestatedinthesubsequenttable:
WorkingSetMByte
incache3incoreoutofcore
50K8M120M
ComputationalRateonPentium200[MFlop/s]2
35328
Relative
1.091.000.25
Afterafewtrialrunswithdifferentmemoryconfigurations,itappearedtousthatblockingOpalforthecachesisnotbeneficialforenhancingtheperformance.ItseemsthattheinnerloopofOpalremainsCPUlimitedinsteadofnotmemorylimited.ThisobservationisalsoconfirmedbyanearlydoubledperformancefortwinprocessorPCnodes(seeupcomingsections).TheperformancebreakdownfortheoutofcoreversionofOpalissodrastic,thatsuchproblemsizeswouldpushtheexecutiontimeimmediatelybeyondthelimitforacceptableturnaroundforonesimulation.
11
Onmostsystemswecouldusethehardwareperformancemonitoringhardwaretoaccountmoreaccuratelyforcachemissesandstronglyconfirmthevalidityofourexperimentsabove.Againtheseobservationsemphasizedtheneedtoinstrumentmiddlewareforperformancemonitoringrightfromthebeginningofanapplicationslivecycle,i.e.assoonastheoriginalapplicationcodeisdesigned,implementedandparallelized.Theauthorsofthispapercanthinkofafewsuitableblockingcodetransformationthatwouldenhancelocalityforbetteruseofthecaches,butwestoppedourinvestigationafterthetrialsmentioned,sincewearenolongerincontroloftheOpalproductioncodedistributionandthereforeweseenopathtoshipourimprovementsintotherealworld.
3Integratingperformanceinstrumentationwithapplicationdesign
3.1
Thecruxofoverheadsandlossoftransparencyduetomiddleware
TheclientserverstructureofparallelOpalisideallysuitedforSciddle[4],aremoteprocedurecall(RPC)systemextensiontothePVMcommunicationlibrary.Sciddlecomprisesastubgenerator(theSciddlecompiler)andarun-timelibrary.Thestubgeneratorreadstheremoteinterfacespecification,i.e.,thedescriptionofthesubroutinesexportedbytheservers,andgeneratesthecorrespondingcommunicationstubs.ThestubstakecareoftranslatinganRPCintothenecessaryPVMmessagepassingprimitives.TheapplicationdoesnotneedtousePVMdirectlyforRPCs:theclientsimplycallsasubroutine(providedbytheclientstub),andtheSciddlerun-timesysteminvokesthecorrespondingserversubroutine(viatheserverstub).Itwas,however,adeliberatedecisioninSciddletolettheapplicationwriterscodetheprocessmanagement(startingandterminatingofservers)directlywithPVMcalls.ThereforeaSciddleapplicationstillneedstouseafewPVMcallsatthebeginningandtheendofarun.WhytousemiddlewarelikeSciddle
Sciddleisahighlyportablecommunicationlibrary.IthasbeenportedtoLinuxPCs,UNIXworkstations,theIntelParagon,andsupercomputersliketheCrayJ90andtheNECSX-4.Inparticular,SciddlesupportsbothPVMsystemsavailablefortheCrayJ90SMPs,thenetworkPVMandthesharedmemoryPVM.BasedonourexperiencestheSciddle/PVMcombinationmightsoundlikeaverysuboptimalsolutionforasingleJ90Classicsystem,thatalsosupportscoherentsharedmemorywellwithinasinglesystem.HoweveratthetimeOpalwasindevelopment,oursitewasoperatingfourCrayJ90sinterconnectedbyHIPPIandthedevelopershadcertainlyplanstousetheirparallelOpalversiononaclusteroffourJ90SMPswith48processorstotal.ForsuchaclusterofSMPsmessagepassingisamustandsharedmemorywouldnotdo.Formostapplicationcodes,theadditionaloverheadofSciddleisverysmall[3],butSciddlecausesalackofcontroloverthePVMoptionflagsandPVMinternaloperations(i.e.theproperuseofdatainplace,sharedmemoryflags).WithaspecificsyntheticRPCtest,Sciddlerunscommunicationatabout7MByte/swhichisjustaboutasmuchastheSciddledevelopersgotoutofaPVMpingpongontheJ90[3].ThereforeweattributethedisastrouslylowcommunicationperformancefortheJ90(anSMPmachinewithafastcrossbar)totheunpredictableperformanceoftheCrayPVMimplementationandtheunfortunateinteractionbetweenmiddlewareandcommunicationlibrary.
WeunderstandthatduetoitsinternalarchitectureanditsAPIPVMisfarawayfromazero-copymessagepassingsystem.ThereforewesuggestedtothedevelopersthatOpalisrewritteninaclean“post-in-advance”styleofMPIprogramming.
3.2Apleaforincludingofhardwareperformanceinstrumentationintomiddleware
ThetaskingfacilitiesofPVMandSciddleinterferewiththenormaluseofperformancemonitoringtoolssuchastheHPMcommandontheCrayJ90systemsorthecorrespondingtoolsontheCrayT3EMMPorIntelPCplatform.WeworkedwiththeimplementorsofSciddletointegratequeriestothelowoverheadcounterdevice(e.g./dev/hpm)intotheSciddlecodeandundertakethenecessaryaccountingforthenumberoffloatingpointoperationsexecutedandfortheclockcyclesusedintheclientandintheservers.InasettingwithahighlevelabstractionRPCmodeltheperformanceinstrumentationmustadheretothesamehighlevelabstractionsandthereforebeintegratedintothemiddlewareaswellastheapplicationcode.ThequestionofagoodAPIstandardforperformancemonitoringinstrumentationisstillanopenone.Debuggersposeaverysimilarsoftwareengineeringproblemfortheparallelprogrammingworld.
12
SamplingbasedtoolsgiveadirectestimateforthecomputerateinMFlop/sandareeasytouse,buttheyareextremelycomplextounderstandinsufficientdepth.Sampledcomputationratesarenosubstituteforthesimpleratioofopera-tionscounteddividedbythecyclesused.ThecharacteristicperformanceofdifferentmachineforOpalrunsinTable1inSection4showshowdifficultitistomeasureMFlopcounts.Thenumberoffloatingpointoperationsrequiredtocomputeexactlythesameapplicationresultsdifferssignificantly,becauseofvectorizingtransformationsandthedifferentimplementationsforintrinsicfunctionslikesqrt()andexponentiate().Withastand-aloneperformancemon-itoringtoolwehadjustbelievedtheMFlop/sfiguresmeasuredandhadpossiblyneverlearnedofthatfact,butjustwonderedwhytheMFlop/snumbersdidnotmakemuchsense.
3.3Thedisadvantageofoverlappedcommunicationandcomputation
ThereisnodoubtthattheSciddlepackageacceleratedthedevelopmentoftheapplicationandthatitsadditionaloverheadstayswellwithinacceptablelimitscomparedtothePVMmessagepassingsystem.HoweverpackageslikeSciddlesupportandencouragetheoverlapofcomputationandcommunicationpreventingadetailedquantificationandcorrectaccountingoftheelapsedtimeforlocalcomputation,communicationandidlewaitsduetoloadimbalance.IntheparallelprogrammingframeworkSciddlewasconceivedfor,itmightbeeasytomeasureandaccumulatehighlevelmetricslikeservercomputationrate,clientcomputationratefortheentireapplicationprogram,butlowlevelindicatorslikecommunicationefficiency,idletimes,andloadimbalanceofsinglepartsaremuchhardertoget.Thelattermetricsaremorerelevantintheperformanceanalysis.InordertofindasolutiontothedifficultiestomeasureandquantifyoverheadsinSciddle,weproposeamodificationtothetimingsynchronizationbehaviortofixthisprob-lem.TheSciddleenvironmentdoesnotprovideexplicitsynchronizationtools,butitallowsadirectcommunicationwiththeunderlyingPVMenvironmentandthereforepermitsexplicitsynchronization.WeintroduceadditionalPVMbarrierstoseparatethecommunicationclearlyfromthecomputation:withthesechangestotheSciddlecommunica-tionenvironment,itispossibletomeasureorcomputeallthesemetricsdirectly.Thebarrierfunctionletstheserverssynchronizethemselvesexplicitlywitheachother.
Manypapershavebeenwrittentoshowhowtoeliminatebarriersandpermitmoreoverlapofcommunicationandcomputation,butthepotentialbenefitofoverlapisoftenoverestimatedbecauseofmemorysystembottlenecksinmostmachines.Fortheoptimalaccountingoftimesamongtheclientandtheserversweareforcedtogiveupsomeoftheoverlap.Tous,theaccuracy,predictabilityandtightcontrolofperformanceappearsmoreimportantandwehappilyacceptasmallslowdown(lessthan5%)overtheoverlappedapplicationforthesakeofasolidunderstandingwhatisgoingonwithperformanceinthecode.
Theuseofasharedcommunicationchannelbetweenserversandclientintroducescontentionduetolimitedresourceattheendofaclientcomputephase.ThiscontentionisapplicationspecificanditislikelytosurfacewiththeoriginalSciddleimplementationifalltheserversperformtheexactlythesameamountofwork(i.e.whentherearenoserveridletimevalues).ThebarriersinthemodifiedSciddleframeworkdonotactuallycausethiseffect,butmerelyexposethiscontentionofthecommunicationbetweensingleclientandmultipleserversinallcases.
4PerformancePredictionforAlternativePlatforms
Inthislastpartofourpaper,weuseouranalyticmodeltogetherwithsomestandardperformancedataofalternativecomputerplatformstopredicttheperformanceofOpalinthecasethatwecouldportthecodetothatplatform.TwodifferentclassesofMPP(MassivelyParallelMulti-Processors)areconsideredforourstudyinadditiontotherealOpalplatform,theCrayJ90:First,theCrayT3E,a“bigiron”MPPandsecond,threedifferentflavorsofPCClusterscalled:slowCoPs(ClusterofPCs),SMPCoPsandfastCoPs.WenamedtheonePCClusterslowCoPssinceitisoptimizedforlowestcostandgainsitsperformancebyalargenumberofslowernodes;weaklyconnectedwithashared100BaseTEthernetmedium,itsuniprocessorsaresomeolderIntelPentiumProPCsrunningat200MHz.TheSMPCoPsplatformisbasedonsimilarIntelPentiumProprocessors,butinatwinprocessorconfiguration(2x200MHz)andinterconnectedbyanimprovedSCIsharedmemoryinterconnecttechnology.FinallythefastPCsClusterfeaturessingle400MHzIntelPentiumProPCsasnodes,connectedbyaGigabit/scommunicationsystembasedonfullyswitchedMyrinetinterconnects.ComparableClustersofPCsinstallationsaredescribedin[5,6,16].
13
4.1Extractionofmodelparametersforalternatives
AsshowninSection2.2,theparametersofouranalyticmodelhavebeenintentionallychoseninawaytoincludeallmajortechnicaldatausuallypublishedforparallelmachines.Thisincludesamongothers:messageoverhead,messagethroughputforlargemessages,computationrateforSAXPYinnerloopsandthetimetosynchronizeallparticipatingprocessors.Foreachnewplatformwedeterminethekeyparametersbytheexecutionofafewmicro-benchmarks,verifiedagainstpublishedperformancefigures[9].AnoverviewofthedatausedisfoundintheTables1and2.
MPP
Time
Type
[s]9.56
(450MHz)
6.18
(100MHz)
10.00
(200MHz)
5.00
(2*200MHz)
4.85
(400MHz)
67
102
65
100
32
50
onsinglenode
[MFlOp/s]
8580a
[MFlOp/s]
52
80
FloatingPoint
Rate
Relative
ComputationRate
fall98ourJ90Classicsarescheduledforanupgradetothenewvectorprocessorsthatsignificantlyenhancethe
computationalthroughput.Theperformanceimprovementovertheclassicprocessorisexpectedtobesix-fold.
aFor
MPP
onsinglenode
Type
CrayT3E-900(MPI)
2000/8
SlowCoPs(Ethernet)
50
FastCoPs(Myrinet)
MByte/s
onsinglenode
(observed)100
10msec
3
25µsec
30
speed-upachievedonaCrayJ90.
AsformanyscientificcodesoneroutineofOpaldominatesthecomputeperformance.Thisroutinehasbeenbench-markedoneachplatformusingthemostaccuratecyclecountersandfloatingpointperformancemonitoringhardwarethatisactuallypresentonallfourmachinetypes.Themostimportantsurprisehasbeenasignificantdifferenceinfloatingpointoperationsforthedifferentplatformsalthoughthearithmeticwas64bitinallcasesandtheresultswerepreciselyidentical(orwithinthefloatingpointepsilonforcomparisonsbetweenCrayandIEEEarithmetic).ThedifferencesareduetothedifferentcompilersanddifferentruntimelibrarieswithintrinsicsfunctionsWeeliminatethisdifferencebyassumingthatthebestcompiler(i.e.thePGIcompilerforthePCs[10])issettingalowerboundforthecomputation:weadjustthelocalcomputationrate(MFlop/s)ofotherplatformsaccordingly.
Thecommunicationperformanceisevenmoredifficulttocompareinrealapplications.SomeunfortunateinteractionsbetweenmiddlewareandPVMlibraryreducesthemeasuredcommunicationrateontheCrayJ90processortoabout3MByte/s,despiteitsmorethanoneGByte/sstrongcrossbarinterconnectbetweenthe8processorboardsandthememorybanks.TheauthorsoftheSciddlemiddlewareclaimthattheymeasuredupto7MByte/sforasyntheticSciddleRPCexampleandthatthisratematchesjustabouttheperformanceofrawPVM3.0onthesamemachine[3].Itcertainlyremainsbelowwhatthismachineiscapableofinsharedmemorymode.
WesuspectthatwiththerightconfigurationofPVMflagsoratleastwitharewriteofthemiddlewaretouseMPIintruezerocopymode,wecouldsignificantlyimprovetheperformanceofOpalontheJ90,butsuchworkisoutsidethescopeofthisperformancestudy.FortheotherplatformsweassumedanMPIorPVMbasedre-implementationwithoutSciddleanddeducedourperformancenumbersmainlyfromMPImicro-benchmarksgatheredbyourstudentsandfromsimilarnumberspublishedbyindependentresearchersontheInternet(e.g.[9]).
4.2Discussion
Thecomplexitymodelincorporatesthekeytechnicaldataofmostparallelmachinesasparameters.Thereforeitiswellsuitedforperformanceprediction.
InthefirsttwographsofFigures5a)-d)welookatthepredictedexecutiontimesfor10Opaliterationswithamediumsizemolecule.PlatformsincludetheCrayT3EMPP,theCrayJ90vectorSMP(reference)andthreeclustersofPCs(fastCoPs,SMPCoPsandslowCoPs).Sincewealsolisttheabsoluteexecutiontimeinseconds,wecandirectlycomparetheperformanceofallfourplatformswhen1-7processorsareusedinCharts5a)and5c).ThesuccessorfailureoftheparallelizationoftheOpalcodebecomesmostevident,whenweplotarelativespeed-upwith1-7processorsintheCharts5b)and5d).Thewellspecifiedsynchronizationmodelguaranteesthatwearenotsubjecttothepitfallsofabadlychosenuni-processorimplementation.
IntheupperCharts5a)and5b)thecut-offradiusistoolargetoreducecomputationandthereforetherunsarelargelycomputebound.TheexecutiontimereflectsthedifferentcomputeperformanceofthedifferentnodeprocessorswithaslightedgefortheSMPCoPsarchitecturewhichbecomesslighterandslighterwiththeincreaseoftheprocessorsnumber.Aentirelycomputeboundoperationinevitablyleadstoexcellentspeedup,asseeninChart5b).
ThemainusersofOpalinmolecularbiologyassuredtousrepeatedly,thatforcertainproblemsasimulationwith
˚cut-offparameterisaccurateenoughtogivenewinsightsintotheproteinsstudied.Thereforeweranthea10A
secondtestcaseofthesamemoleculewithacomputationreducedbycut-off.InthelowertwoCharts5c)and5d)thecomputationisacceleratedwithaneffectivecut-offparameterandthereforegraduallybecomescommunicationboundastheparallelismincreases.Inthiscasethecommunicationperformanceofthemachinedoesmatteralot.TheCrayJ90andtheslowCoPs(Ethernet)ClusterofPCsareseverelylimitedbytheirslowcommunicationhardwareorbytheirbadsoftwareinfrastructureformessagepassing.Thisisvisibleinpredictedexecutiontimes:assoonasthenumberofprocessorsincreasesandexceedsthevalueofthree,theoverallexecutiontimeoftheapplicationontheCrayJ90andtheslowCoPs(200MHzwithEthernet)isnolongerdecreasingbutratherincreasing.Theincreaseofthecommunicationtimeoffsetsanygainduetoparallelexecutionandleadstoanoveralllossofperformanceforalargernumberofnodes.Thisaspectisdisplayedbysomespeed-upcurvesinCharts5d)whichactuallyturnintoslow-downcurveswhentoomanynodesareadded.Forthesetwoarchitecturesweachievenobenefitinputtingmorethanthreeprocessorsatwork.
ForasmallnumberofprocessorstheSMPCoPsandFastCoPsarchitecturesstartoutwithabetterexecutiontimethanthebigMPPandvectorSMPirons,possiblyduetothebettercompiler.HoweverwiththeincreaseofthenumberofprocessorsthespeedoftheCrayT3EMPPcatchesupquiterapidlyduetothebettercommunicationsystem.Thistrendisalsoevidentinthespeed-upcurveswheretheCrayT3Earchitectureachievesbettergainandalmostideal
15
speed-up.Forallplatformswithagoodcommunicationsystemwecanscaletheapplicationnicelyto7processorwithaspeed-upof4orgreater.
600500Relative speed-up400Time (sec)300200100012345Number of servers
(a)
454035Relative speed-up30Time (sec)252015105012345Number of servers
6710
1
2
345Number of serversruFast CoPsSMP CoPs(d)q6
7
wrq!u
Exec. time of Opal on different platforms(medium problem, with cut-off, full update)765432
!ruqw
!ruqw
!ruqw
!ruqw
67Exec. time of Opal on different platforms(medium problem, no-cut-off, full update)76543210
1
2
345Number of servers
(b)
Speed-up of Opal on different platforms(medium problem, with cut-off, full update)
!
r
ru
u!
6
7
wrq!u
Speed-up of Opal on different platforms (medium problem, no cut-off, full update)
!ruq
!ruqww!ruq
w
!ruqw
!ruqw
!ruqw
qw
qw
Slow CoPs!wCray T3ECray J90Slow CoPs(c)Figure5:PredictedexecutiontimeforanOpalsimulationofamediumproblemsizemolecule.
AswecanseeinthetwoGraphs5c)andd),speed-upcurvedcannotbeinterpretedproperlywithoutlookingatthe
absoluteexecutiontimessimultaneously;whiletheCrayT3EMPPhasbyfarthebestspeed-up,itstillendsupbehindfastCoPsandSMPCoPswhencomparingabsoluteperformanceforsevenservers.
ThesameperformancescalabilityrelationshipsarereflectedintheFigures6a)-d)foralargesizeproblem.Thechartsshowpredictedexecutiontimesandspeed-upsforalargeproblem.AcomparisonbetweentheCharts6a)-d)and5a)-d)showshowthebehavioroftheexecutiontimeremainsquitesimilartothemediumsizeproblem.Atthesametimewenoticethattheincreaseoftheamountofthecomputationforalargesizeproblemleadtoslightlybetterspeed-upsinChart6b).Stillbothchartsindicateflatspeed-upformoreprocessorsduetooverheadinthecommunicationsystems.InChart6d)wedonothavetheextremeslowdownseenininChart5d),butwecanconcludethattheincreaseoftheamountofthecomputationhasjustpushedthepointofthebreakdownfurtheroutwardsonthecurve.Withalargernumberofprocessorswewouldprobablyencounterthesamesaturationpointatwhichaddingprocessorswouldstoptoincreaseperformance.
16
16001400Exec. time of Opal on different platforms:(large molecule, no-cut-off, full update)76Relative speed-up543210
Speed-up of Opal on different platforms(large molecule, no cut-off, full update)
!ruqw!ruqw!ruqw!ruqw!ruqwwrq!u!ruqw1200Time (sec)1000800600400200012345Number of servers
(a)
6712
345Number of servers
(b)
67
8070Exec. time of Opal on different platforms:(large molecule, with cut-off, full update)76Relative speed-up543210
Speed-up of Opal on different platforms(large molecule, with cut-off, full update)
!!r60Time (sec)5040302010012Cray T3ECray J90345Number of servers
Fast CoPsSMP CoPs(c)
67!r!ruqw!ruqw
uqwruu!ruqwwrq!uqwqw1!w2Cray T3ECray J90345Number of serversruFast CoPsSMP CoPs(d)q67Slow CoPsSlow CoPsFigure6:PredictedexecutiontimeforanOpalsimulationofalargeproblemsizemolecule.
5Conclusion
OurcasestudyofOpalshowedcommonproblemswiththeperformanceinstrumentationinanapplicationsettingwithRPCmiddlewareforparallelizationandPVMcommunicationlibraries.Somemiddlewarehadtobeinstrumentedwithhooksforperformancemonitoringandtheoverlapofcommunicationandcomputationhadtoberestrictedslightlyforareliableaccountingofexecutiontimes.Wecanstatethreepotentialbenefitsoftheintegratedapproachforaccurateperformanceevaluation,modelingandpredictioninparallelprogramming:firstly,theanalyticcomplexitymodelandacarefulinstrumentationforperformancemonitoringleadstoamuchbetterunderstandingoftheresourcedemandsofaparallelapplication.Werealizethatthebasicapplicationwithoutcut-offisentirelycomputeboundandthereforeparallelizeswellregardlessofthesystem.Theoptimizationwithanapproximationalgorithmusinganeffectivecut-offradiuschangesthecharacteristicsofthecodeintoacommunicationcriticalapplicationthatrequiresastrongmemoryandcommunicationsystemforgoodparallelization.Secondly,wediscoveredinterestinganomaliesintheimplementation,e.g.theloadimbalanceforevennumberofserversandthedifferingnumberoffloatingpointoperationsfordifferentprocessors.Thirdly,wecanuseourmodeltopredictwithgoodcertaintyhowtheapplication
17
wouldrunonslowCops,SMPCoPsandfastCoPs,threelowcostClusterofPCsplatformsconnectedbyGigabitNetworks,likeSCIorMyrinet.ThemigrationoftheOpalsimulationcodetotheclusterofPCplatformcouldpotentiallyfreeourupgradedCrayJ90SMPvectormachinesformorecomplexandmemoryintensivecomputationswithlessregularity.ThepredictedexecutiontimesandspeedupfiguresindicatethatawelldesignedclusterofPCsachievessimilar,ifnotbetterperformancethantheJ90ClassicvectorprocessorscurrentlyusedforOpalandthatthecomputationalefficiencycomparesfavorablyeventotheT3E-900forthisparticularapplicationcode.
Acknowledgments
Wewouldliketoexpressourthankstoallthepeoplewhohelpedusduringthiswork.WeareverygratefultoPeterArbenz,WalterGander,HansPeterL¨uthi,andUrsvonMatt,whocreatedSciddletoparallelizeOpal,fortheirhelpandparticularlyPeterArbenzandUrsvonMattforreadingcarefullythroughseveraldraftsofourwork.WesincerelythankMartinBilleter,PeterG¨untert,PeterLuginb¨uhlandKurtW¨uthrichwhocreatedOpalandparticularlyPeterG¨untertforhishelpandhischemistryadvice.WethankCarolBeatyoftheSGI/CRIandBrunoL¨opfeoftheETHRechenzentrumwhohelpedwithourmanyquestionsabouttheCrayJ90andCrayPVM.WearealsoverygratefultoNickNystromandSergiuSanieleviciofthePittsburghSupercomputerCenterwhosponsoredourparameterextractionrunsfortheperformancepredictionoftheCrayT3E-900.
References
[1]P.M.Alsing.N-bodyproblem:Forcedecompositionmethod.1995.http://www.phys.unm.edu/phys500/-lecture4/forcedecomp
docs,/pgiws
[12]P.Luginb¨uhl,P.G¨untert,andM.Billeter.OPAL:User’sManualVersion2.2.ETHZ¨urich,Institutf¨urMoleku-larbiologieandBiophysik,Zrich,Switzerland,1995.[13]P.Luginb¨uhl,P.G¨untert,M.Billeter,andK.W¨uthrich.Thenewprogramopalformoleculardynamicssimula-tionsandenergyrefinementsofbiologicalmacromolecules.J.Biomol.NMR,1996.ETH-BIBP820203.[14]D.O’Hallaron,J.Shewchuk,andT.Gross.Architecturalimplicationsofafamilyofirregularapplications.In
Proc.4ndSymp.onHighPerformanceComputerArchitecture,pages?–?,LasVegas,Jan1998.IEEE.ExtendedversionappearedasTechnicalReportCMU-CS-97-198,CarnegieMellonSchoolofComputerScience.[15]PlimptonS.andHendricksonB.Anewparallelmethodformoleculardynamicssimulationofmacromolecular
systems.SandiaThechnicalReport,SAN94-1862,1994.[16]Sobalvarro,Pakin,Chien,andWeihl.Dynamiccoschedulingonworkstationclusters.ProceedingsoftheInter-nationalParallelProcessingSymposium(IPPS’98),March30-April31998.[17]M.Taufer.Parallelizationofthesoftwarepackageopalforthesimulationofmoleculadynamics.Technical
report,SwissFederalInstituteofTechnology,Zurich,1996.[18]U.vonMatt.Sciddle4.0:User’sguide.Technicalreport,SwissCenterforScientificComputing,Zurich,1996.[19]P.K.WeinerandPA.Kollman.Amber:Assistedmodelbuildingwithenergyrefinement.ageneralprogramfor
modelingmoleculesandtheirinteractions.J.Comp.Chem.,(2),1981.
Authorbiographies
MichelaTauferreceivedherbachelorsandmastersdegreesincomputerscienceengineeringfromUniversityofPadua,Italyin1996.SheiscurrentlyadoctoralstudentattheSwissFederalInstituteofTechnology(ETH)inZrich,Switzer-landandisworkingonhighperformancecomputinganddatabaseapplicationsforclustersofPCs.
ThomasStrickeriscurrentlyanassistantprofessorofcomputerscienceattheSwissFederalInstituteofTechnology(ETH)inZrich.Hisresearchgroupisinves-tigatingarchitecturesandapplicationsofclustersofPCsthatareinterconnectedwithgigabitinterconnecttechnologies.ThomasStrickerattendedCarnegieMel-lonUniversityinPittsburgh,USAforhisPh.D.studies,whereheparticipatedinseverallargesystemsbuildingprojectsincludingtheconstructionoftheiWarpparallelmachines.HealsoholdsundergraduatedegreesfromETHinZrichandisamemberoftheACMSIGARCH,SIGCOMMandtheIEEEComputerSociety.
19
因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- howto234.com 版权所有 湘ICP备2022005869号-3
违法及侵权请联系:TEL:199 1889 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务