您好,欢迎来到好土汽车网。
搜索
您的当前位置:首页Accurate Performance Evaluation,-Modeling and-Prediction of a Molecular Simulation coded wi

Accurate Performance Evaluation,-Modeling and-Prediction of a Molecular Simulation coded wi

来源:好土汽车网
Appears in: Proceedings of Supercomputing 98, IEEE/ACM Conference on Supercomputing7-13 Nov 1998, Orlando, FL, USA

AccuratePerformanceEvaluation,-Modelingand-PredictionofaMolecularSimulationcodedwithMessagePassingMiddleware

MichelaTauferandThomasStrickerLaboratoryforComputerSystemsSwissInstituteofTechnology(ETH)CH-8092Zuerich,Switzerlandtaufer@inf.ethz.ch,tomstr@acm.org

Abstract

Indistributedandvectorizedcomputingthereisalargenumberofhighlydifferentsupercomputingplatformsanapplicationcouldrunon.Thereforemosttraditionalparallelcodesareillequippedtocollectdataabouttheirresourceusageortheirbehavioratruntimeandthecorrespondingdataarerarelypublished.Onlyfewcomputationalscientistsexploretheinteractionsoftheirtargetplatformswiththeirapplicationsystematically.Asanimprovementoverthecurrentstateoftheart,weproposeanintegratedapproachtoperformance-evaluation,-modelingand-predictionfordifferentplatforms.Ourapproachreliesonacombinationofanalyticalmodelingandsystematicexperimentationwithfullapplicationruns,applicationkernelsandsomebenchmarks.WeoutlineourmethodologyofperformanceassessmentwithOpal,anexamplecodeinmolecularbiology,developedatourinstitutiontoruninparallelonCrayJ90“Classic”VectorSMPs.BesidesadetailedassessmentofperformanceachievedontheseJ90s,theprimarygoalofourstudywastofindthemoresuitableandpotentiallymorecosteffectivehardwareplatformfortheapplication,inparticulartocheckthesuitabilityofthisapplicationforslowCoPs,SMPCoPsandfastCoPs,threeflavorsofClustersofPCsbuiltwithoff-the-shelfmicroprocessorsatourcomputersystemslaboratory.TheperformanceassessmentbasedonourmodelismucheasierthanportingandparallelizingtheapplicationforanewtargetmachineandsowecouldalsoobtainandincludeperformanceestimatesforaT3E-900,ahighendMPPsystem.ThepredictedexecutiontimesandspeedupfiguresindicatethatawelldesignedclusterofPCsachievessimilarifnotbetterperformancethantheJ90vectorprocessorscurrentlyusedandthatthecomputationalefficiencycomparesfavorablytotheT3E-900forthatparticularapplicationcode.

1Introduction

Alargevarietyofdifferentparallelcomputingplatformsmakesitquitedifficulttopickthethebestsuitedandmostcosteffectiveparallelcomputerasatargetplatformforanewapplicationcodetorunon.Thequestionofthebestplatformforanapplicationshouldbeaddressedearlyinthedesignprocessassoonasthecharacteristicsoftheapplicationcodebecomeknownduringearlyprototyping.Typicallythescientistsintheapplicationfieldareonlyinterestedintheresultsofthecomputationitselfandrarelyinhowfastandhowefficientlytheywerecomputed.Thereforemosttraditionalparallelcodesareillequippedtocollectdataabouttheirresourceusageortheirbehavioratruntimeandthecorrespondingdataarerarelypublished.Thereareafewexceptionstothisrule-like[14],apaperdevotedentirelytothecharacteristicsofalargeFEMapplication.

MostmachinesinhighperformancecomputingandeventodaysPCshavegoodhardwareinstrumentationtocollectallthenecessarydata,butmostsystemvendorsdon’tpromotedirectaccesstothem.Insteadtheyprovidehigh-levelperformancetuningandadvisorytoolswithlittleinformationabouttheirresolution,theiraccuracyandtheirtheoryofoperation.Furthermorethosetoolsofteninterferewiththetaskingsupportoftheparallelizationandcommunicationtools(i.e.themiddlewareforparallelization).Commonotherproblemsincludetheclient/serverparadigmwithfulloverlapofcomputationandcommunicationandmanylatencyhidingmechanisms,thatmakeaccurateanddetailedperformancemeasurementsalmostimpossible.

1

Asanimprovementoverthecurrentstateoftheart,weproposeanddemonstrateanintegratedapproachtoapplicationdesign,parallelizationandperformanceanalysisusingacombinationofanalyticalmodelingandmeasurements.WestudiedthisintegrationwithOpal[13],anexamplecodeinmolecularbiology,developedatourinstitutiontorunonourfourCrayJ90“Classic”VectorSMPs,with8-16processorseach.BesidesadetailedassessmentofperformanceachievedontheJ90s,theprimarygoalofourstudywastofindthemostsuitableandmostcosteffectivehardwareplatformfortheapplication,inparticulartocheckitssuitabilityforfast,slowandSMPCoPs,twoflavorsofClustersofPentiumPCs,builtbyourcomputerarchitecturegroup.

InChapter2wedevelopandpresentananalyticalcomplexitymodeltopredicttheexecutiontimeforOpalsimula-tionswithdifferentinputparameters.Wecalibratethemodelwithasystematicexperimentaldesignandshowwhatwelearnedfromdetailedanalysisofcomputationandcommunicationperformance.InChapter3wediscusstheintegra-tionofperformancemonitoringintomiddlewarepackagesanddespiteoverlappedcommunicationandcomputation.InChapter4weusethemodeltogetherwithsomearchitecturalkeydatatopredicttheefficiencyofOpalonalternativeplatformsincludingClustersofPCs.

2

2.1

Casestudy:Instrumentingamolecularbiologycode

AbriefdescriptionofOpal

Opalisasoftwarepackagetoperformthesimulationofthemoleculardynamicsofproteinsandnucleicacidsinvacuumorinwaterthroughenergyminimization.Opalusesclassicalmechanics,i.e.,theNewtonianequationsofmotion,tocomputethetrajectoriesritofnatomsasafunctionoftimet.Newton’ssecondlawexpressestheaccelerationas:

d2mi

¶rit

Vr1t

rnt

AtypicalfunctionVhastheform:

Vr1

rn

allbonds

1

2

1

C6ij

KΘθ

θ0

2

improperdihedrals

ri12j

4πε0εrrij

Thefirsttermmodelsthecovalentbond-stretchinginteractionalongbondb.Thevalueofb0denotestheminimum-energybondlength,andtheforceconstantKbdependsontheparticulartypeofbond.Thesecondtermrepresents

thebond-anglebending(three-body)interaction.The(four-body)dihedral-angleinteractionsconsistoftwoterms:aharmonictermfordihedralanglesξthatarenotallowedtomaketransitions,e.g.,dihedralangleswithinaromaticringsordihedralanglestomaintainchirality,andasinusoidaltermfortheotherdihedralanglesϕ,whichmaymake360turns.Thelasttermcapturesthenon-bondedinteractionsoverallpairsofatoms.ItiscomposedofthevanderWaalsandtheCoulombinteractionsbetweenatomsiandjwithchargesqiandqjatadistancerij.

AfirstserialversionofOpal,Opal-2.6,wasdevelopedattheInstituteofMolecularBiologyandBiophysicsatETHZ¨urich[12].ItwaswritteninstandardFORTRAN-77andoptimizedforvectorsupercomputersthroughafewvector-izableloops.IntheserialcodeofOpal-2.6asingleprocessorrunsthewholecomputation.Opal-2.6spendsmostofthecomputingtimeduringasimulationevaluatingthenon-bondedinteractionsoverallpairsofatomsofthemolecularsystem(thelasttermoftheatomicinteractionfunctionV).Fortunately,thesecalculationsalsoofferahighdegreeofparallelisminadditiontothevectorizableinnerloops.

2

TheparallelversionofOpal

TheparallelversionofOpal[17,2]distributesitsworkamongmultipleprocessorsinaclient-serversetting:multipleserverssharethecomputationoftheVanderWaalsandCoulombforceswhileoneclientcomputesthefewremaininginteractionsandcoordinatesthework.Thecomputationrepeatsforeverytimestep.

Foramolecularcomplex1ofnatoms,thenumberofnon-bondedinteractionsbetweenatoms,whichmustbeevaluated,isoftheorderofn2.InthenewversionofOpal,thissequentialcomplexityofthemolecularenergiesevaluationisreducedbyneglectingmanyofthenon-bondedinteractionsfromthemolecularenergycomputation:onlythepairsofatoms,whosedistanceislessthanacut-offparameter,aretakenintoaccount.Atfirst,thedatadescribingthenon-bondinginteractionparametersbetweenthesolute-solute,solute-solvent,solvent-solventatomspairsarereplicatedonalltheservers.Thisglobalinformation,whosevolumedependsontheproblemsizeanddoesnotscalewiththenumberofprocessors,allowseachservertoachievealargeindependence.Withitsdata,eachserverrunsitstasksofthesimulationrequestingnofurtherparametersateachstepfromtheclientthantheatomcoordinates.

Asimulationproceedsbyrepeatingthesamecomputationtaskscontinuously.Attheendofeachsteptheinformationaboutthetotalenergy,volume,pressureandtemperatureofthemolecularcomplexisdisplayed.Inthefirststageofeachsimulationstep,whichwecallupdatephase,eachserverselectsadistinctsubsetoftheatompairs,checksthedistanceamongtheatomsofeachpairandaddsthepairtoitsownlistofallactivepairswhentheatomsarenotbeyondthegivendistancecut-off.Inthesecondstageofthesimulationstep,theserverscomputepartialnon-bondedenergies(VanderWaalsenergyandCoulombenergy)usingthelistofallactivepairs.Attheendofthisstepeachserversendsitspartialresultstotheclientwhichgathersthemandsumsthetotalmolecularenergyofthemolecularcomplexaswellasitsvolume,pressureandtemperature.

Thedataineachlistareupdatedperiodically.TheintervalbetweensuccessiveupdatescanbeselectedbytheuserthroughthesettingofanOpalparametercalledupdate.Thevalueoftheupdateparameterexpressesthenumberofinteractionstepsafterwhichthelistsofallactivepairsareupdated.

Thedistributionoftheatompairsfortheevaluationoftheenergiesduetothenon-bondedinteractionsisdoneusingapseudo-randomstrategy.Randomizationshouldhelptobalancetheworkloadamongtheserversandtoavoidduplicationofwork.

Moreover,withaslightchangeofthemolecularsimulationmodel,i.e.,theuseofwatermoleculesassingleunitscenteredintheoxygenatomsinthesolventinsteadofthreeindividualatoms,weaccomplished:

areducedworkloadoftheservers,

areductioninsizeofthelist(memoryusage),

anincreaseinaccuracyforthemolecularenergycalculationswithsmallcut-offradii.

Alternativeparallelizationsformolecularsimulations

Theparallelizationofthenon-bondedpairwiseenergycomputationthroughthedistributionofthemasscentersamongstseveralprocessorsusedforOpalisnottheonlyparallelizationapproach.Therearethreemainapproachestoaparallelizing:thereplicated-data(RD)methodusedforOpal,inwhichthemasscenters(i.e.atoms)aredistributedamongtheprocessors,thegeometric-orspace-decomposition(SD)method,inwhicheachprocessorconsidersthemasscentersintoitssub-domainduringthesimulationandtheforce-decomposition(FD)methodinwhichtheforcematrixFij(Fijistheforceonmasscenteriduetomasscenterj)ispartitionedbyblocksamongtheprocessors[15,1].

Comparablepackagesinmolecularbiology

WithitsparallelizationtheparallelversionofOpalhasbecomesimilartoAmber[19].Boththecodesallowtheusertocarryoutenergyminimizationandmoleculardynamicsusingthesameanalyticalfunction(see2.1).Moreover,boththecodesusemolecularcomponentsinteractionlists,whichneedperiodicalupdates,andallowtoevaluatetheinteractionsofeachatomwiththerestofthemolecularcomplexintoacut-offdistance.FortheparallelversionofAmber,ageneralizedMPIinterface[7]isusedformessage-passing,whiletheparallelversionofOpalreliesonthePVMinterface,andtheSciddleRPCmiddlewarepackage[4,18].StillbothcodesareexplicitlyparallelandwellsuitablefordistributedmemorymachineswithamessagepassingAPI.

2.2AtimecomplexitymodelforOpal

DuringthedesignandparallelizationofOpalwederivedananalyticaltimecomplexitymodelthatcapturesallessentialparametersoftherealapplication.ThepredictedoutcomeofthemodelistheexecutiontimeofOpalinseconds,writtenasasumofseveralpartialresultvariableswhicharecomputedseparatelyandalsomeasuredseparatelyduringvalidation:

tOPALttotttotttotcompcompsync

ttotcomp,thetotalparallelcomputationtime,isthecomputationtimespentbytheserversservicingthe

requestforthecomputation.Theserversruntworoutinesasparallelwork:theupdateroutinethatupdatesthelistsofatompairs,andtheenergyevaluationroutinethatevaluatesthepartialenergiesofthenon-bondedinteractions(VanderWaalsenergyandCoulombenergy).

ttot

comp

tupdate

tnbint

(2)

Thecomputationtimeoftheupdateroutinealwaysgrowsquadraticwithproblemsizebecauseeachtimetheserversupdatetheirownlist,allthepairsofatomsmustbechecked.Atthesametime,theupdatetimedecreaseslinearlywiththeincreaseofthetimeintervalbetweentwolistupdates.

tupdatenγ

a2

su

2

(3)

where:

–sisthenumberofsimulationsteps.

–pisthenumberofserversonwhichruntheOpalapplication.–uisthefrequencyofthelistupdates(updateparameter).

–nrepresentsthenumberofmasscenters(atomsandwatermolecules)ofthewholemolecularcomplex.–γ(gamma)istheratioofnumberofwatermoleculestothetotalnumberofmasscenters.

–a2representsthecomputationtimespenttogenerateapairofatomsandcalculatethedistancebetweenthem.

Ontheotherhand,thetimefortheenergyevaluationroutineissubjecttotheeffectsofthecut-offparameter:thedimensionofthelists,onwhichpairsthepartialenergiesareevaluated,increasesdrasticallywiththeincreaseofthecut-offdistance.Theenergy-evaluationroutinegrowsquadraticuptothenumberofatomswithinthecut-offradiusandlinearbeyondthat.

tnbintnn˜

a3

a3

ss

2

whenn

molecularcomplexvolumeaswellasthecut-offparameter.Foroursimulations(see2.5)crossoverhappensforunrealisticnumbersofwatermoleculesorproteinatomsi.e.forsufficientlyhighvaluesofn.Furthermoreareductionoftheupdatefrequencyispossibletoreducethefractionofupdatecomputationarbitrarilyandrestoretherelationof:

tnbinttupdate

Wesummarizethetotalparalleltimeas:

s

1

2p

a2u1

a3n

whenn

ttot

comp

s

1

p

a3n˜

a2u

12γ

,thetotalcommunicationtime,isthetimespentbythecommunicationprocessesbetweentheclient

andtheserversduringtheentiresimulation.Theclientcallstwodifferentkindsofprocedures(subroutines)thatarerunontheservers:thesubroutineforlistupdatesandthesubroutineforenergyevaluation.Weenhancedthecommunicationenvironmentwithsomesynchronizationtoolsthatallowustoseparatethecommunicationtimesproperlyfromothercomputationandidletimesandthereforepermittoexplainallcommunicationcomponentsprecisely.Moredetailsaboutthesesynchronizationtoolsandtheirunderlyingmodelareexplainedin[17].Thankstothismodeltheresultingcommunicationtimeoftheclient’sRPCscanbedecomposedinto:

ttot

upd

treturn

nbi

treturn

upd

tcall

a1

n

b1(7)

inadditiontothequantitiesdefinedabovewedefine:

–α(alpha)isthenumberofbytesusedtorepresentthecoordinatesofasingleatom.

–a1isthecommunicationrateincludingtheoverheadinthecommunicationenvironment(SciddleandPVM)

–b1isthecommunicationoverhead,inseconds,usedtotransferanemptyblockfromthesendertothereceiver

FortheupdateRPC,theclientdoesnotretrieveanydatafromtheserverswhentheyarriveattheendoftheupdateroutine:theclientjustwaitsforaresultmessagewhichassurestheendoftheservertasks.

treturn

nbi

2

α

a1α

n

b1

α

χοµµ

sp

ttot

offourterms:–tstr

upd,

sync

isthesum

thetotaltimetosynchronizetheclientandtheserverswhentheupdateroutinesfinish,

–tstr

nbi,

thetotaltimetosynchronizetheclientandtheserverswhentheenergyevaluationroutinesfinish.

Weassumethatthefourdifferentsynchronizationtimesdonotdependonthenumberofserversnorontheproblemsize.Ourformulationjuststatesthateachtermincreaseslinearlyinthenumberofsimulationstepsandmoreoverthecontributionoftstrupddecreaseastheupdateparameterdecreases.Weassumethateachsynchronizationprocesstakesaconstanttimeb5.

ttot

thecut-offparameter(approximationproperties),

theupdatefrequencyparameter(communication-computationbalance).

Forthecalculationweconsideronetosevenservers,small,mediumandlargeproblems,full-vs.partial-updatesand

˚vs.alarge,ineffectiveoneat60A.˚Theresponsevariablestwodifferentcut-offradii-asmall,effectiveoneat10A

measuredarethecommunication,parallelcomputation,sequentialcomputation,idletimeandsynchronizationtimeaslistedinthecompositeformula.Theexperimentsalwaysrunonadedicatedsystemandthereforethereisnooverheadonthemeasurementsduetoatimesharingenvironment.Inafewpreliminarytests,everymeasurementhasbeenrepeatedseveraltimes.Thetestshaveconfirmedalowvariabilityandagoodreproducibilityoftheexecutiontimes,andwethereforehaveconcludedthattensimulationstepssufficetoassureanaccurateandmeaningfultimingofanentiresimulationoftheproteinfoldingprocess,whichwecallasingleexperimentorcaseinthispaper.

2.4UnderstandingtheexecutionofOpalwithourmodel

WeinvestigatetheperformanceoftheOpalcodebymeasurementsofsimulationexecutiontimesoftwomolecularcomplexeswithdifferentsizes:theparallelcomputationtime,thesequentialcomputationtime,thecommunicationtime,thesynchronizationtime,andtheidletime.Wemeasurethedetailedbreakdownofthewallclockexecutiontimefortensimulationsteps.

ThefirstmolecularcomplexisamediumsizeexampleofthesimulationproblemsthatOpalcanhandle:itisthecomplexbetweentheAntennapediahomeodomainfromDrosophilaandDNA[8],composedby1575atomsandimmersedin2714watermoleculesoratotalof4289masscenters(mediumproblemsize).Thesecondmolecularcomplexisconsideredtobealargesizeproblem:itistheNMRstructureoftheLFBhomeodomain,composedby1655atomsandimmersedin4634watermolecules,atotalof6289masscenters(largeproblemsize).

Werunthecodefordifferentlevelsofparallelism:thenumberofserversrangesfromonetoseven.Atthesametimewemeasuretheexecutiontimeswhenthesimulationisfullyaccurateandthecomputationcomplexityisquadraticintheproteinsize(i.e.nocut-off)andwhenthesimulationisapproximateandconsequentiallythecomputationcomplexitybecomeslinear(i.e.withcut-off).Finally,weinvestigatetheroleofthelistsupdate:werunthesimulationeitherwithanupdateofourlistsuponeveryiteration(fullupdate)orwithapartialupdateevery10iterations(partialupdate).Thecomparisonofthedifferentcasespermitstostudythescalability,i.e.,theexecutiontimesastheydependonincreasingparallelismandtoinvestigatethepreciseimpactofthedifferentproblemsizes,thefrequencyofthelistsupdateandthevaluesofthecut-offparameterontheperformanceofthesimulation.

Figures1a)-d)displayadetailedbreakdownofthewallclockexecutiontimefor10simulationstepsinthemediumsizemolecularcomplexwithdifferentchoicesforthenumberofservers,thecut-offandtheupdateparameters.ThechartinFigure1a)showsthatwithoutcut-off,thetimeinparallelcomputationisthelargestfractionoftheexecutiontimeandthatitdecreasesasexpectedwhenmoreserversareadded.Atthesametimethecommunicationtimeincreasesaboutlinearwiththenumberofservers,butitsoverallcontributionremainssmall,evenforsevenservers.Thesynchronizationtimeandthesequentialcomputationtimeremaininsignificanttotheoverallexecutiontime.HowevertothesurpriseoftheOpalimplementors,ourinstrumentationrevealsaloadbalancingproblemforrunswithanevennumbersofprocessors.Figure1b)showsanOpalexecutionwithreducedupdates.Asexpected,thelowerupdatefrequencydoesnotaffecttheoverallperformanceonsimulationsmuchbecausethelargeamountofparallel

˚computationdominatestheexecutiontime.Figure1c)showsasimulationwithaneffectivecut-offparameter(at10A).

Thecut-offparameterdeterminestheasymptoticcomputationalcomplexity:theamountoftheparallelcomputationissmallerthaninthecasesaboveanditsoverallcontributionbecomescomparabletotheothermeasuredexecutiontimes.Thesequentialcomputationtime,thesynchronizationtimeandthecommunicationtimegainahighimportancefortheoverallperformance.Figure1d)displaysarunwithboththecut-offandthepartialupdateoptionineffect.Thefrequencyofthelistupdatesleadstoanotabledifferenceintheperformanceofsimulationswithsmallcut-offradii.Theproblemsizeitself,thenumberofatomsofthewholemolecularcomplex,hasavaryingimpactonthedifferentcomponentsoftheexecutiontime.Thetimecomponents,theparallelcomputationtime,thecommunicationtime,theidletime,thesequentialcomputationtimeandthesynchronizationtime,increaseeachoneinadifferentwaywiththenumberofatoms:whilethesizeoftheproblemhasasuper-linearimpactontheparallelcomputationtimeandtheidletime,ithasonlyalinearmoderateimpactonthesequentialcomputationtimeandthecommunicationtime.Figure2a)-d)showthedetailedbreakdownofthewallclockexecutiontimefor10simulationstepsinthelargesizemolecularcomplex.Whiletheorderofthemeasuredexecutiontimedoublesasweincreasetheproblemsizefroma

7

Execution time of Opal on Cray J90(medium molecule, no cut-off, full update)

123Number of servers

4567400350300Time (sec)2502001501005001000900Execution time of Opal on Cray J90(large molecule, no cut-off, full update)Execution time of Opal on Cray J90(large molecule, no cut-off, partial update)12Execution time of Opal(large molecule, with12parallel comp. timescalar comp. timeidle time2:Detailedwithaupdate)andwithcut-offtimewehaveadjustedtheshowpredictedbyfrequencies(723thedifferencesofthemodelislisted3456Number of servers

(a)

on Cray J90cut-off, full update)3456Number of servers

communication timesynchronization time(c)breakdownofthemeasuredmolecule

simulationwithoutcut-offparameter(i.e.foreachatomcomputedtheoutcomeoftheforalastsquarecomparisonofthewallclockanalyticalmodelfortheandlargeormediummolecularalthoughthedatawasbetweenmodelandmeasurementtothemeasurementforthe[17].

(b)

on Cray J90cut-off, partial update)234567Number of servers

communication timesynchronization time(d)iterationsofanOpalsimula-interactionsaretherangeof10A

˚considered)versusareconsidered).analyticalmodelforeachoneoftheseexecutiononaCrayJ90SMPagainstnumbersofservers,differentcut-offinthepaper,welistonlythedatadesignof84experiments.Duringandplottedforeachcase.Thefortheremainingcasesisexcellent.800700Time (sec)60050040030020010007605040Time (sec)3020100605040Time (sec)3020100Execution time of Opal(large molecule, with71parallel comp. timescalar comp. timeidle timeFigureexecutiontimesfor10tionlargestep(fullaparameter(i.e.alltheatomsa

Atsimulationonlyinteractionswithinthesamesimulationthroughthecasesandparametersfittothecorrespondingmeasurements.

Figures4a)-d)thetimesmeasuredforthethetimesthesamemachinewithdifferentradii,updatecomplexes.Forbrevityofareduced1)-designachievedwithafullfactorialthecalibrationhavebeeninvestigatedoverallfitcasesinFigures4a)-d)andThefulldatain9

The parameter space ofOPAL molecular simulationspartial update (0.1)

update frequencyfully effective cut(60 Å)full update (1)

pt)(6200large medium (46no coff (10 Å)utoff iusff radCut oProblem Size)t00 p00 pt)(16Legend:small case & calibration data in paper case shown, calibration not data available in extended reportFigure3:ParameterspacemodelofOpalmolecularsimulation.

2.6SpacecomplexityofOpal

Executiontimeisnottheonlycomplexitymeasurethatcanbetreatedinthismanner.Aspacecomplexitymodelformemoryissueislargelyorthogonaltotheexecutiontimemodel.

Memorymanagementremainsahighlycomplexissueintheparallelization.TheparallelOpalhasbeendesignedtousememoryinthemosteconomicalway:eachserverhasonlyapartofthenon-bondedinteractionspairsofatoms.Thedimensionofthedatalistsoneachserverscalesdownlinearlywiththenumberofprocessors.

Ontheotherhand,eachserverneedsthesameglobaldata(informationaboutthesolution-solution,solution-solvent,andsolvent-solventnon-bondedinteractions)whoseamountofdatadependsontheproblemsize,i.e,thenumberofwaterandmolecularatoms.Thisglobalinformationisconstantanddoesnotscaleupordownwiththenumberofprocessors.Thecomputationoftheseglobaldatastructuresoneachserverinvolvesaduplicationofworkbutsavessomecommunication.Furthermore,thiscomputationoftheseconstantstakesplaceatthebeginningofthesimulation,anditscostisamortizedoveralltimestepsoftheentiresimulation.Oncetheglobaldatainitialized,eachservercanexecutesitscomputationlargelyindependent,becauseitevaluatesitspartialnon-bondedinteractiontermswithoutrequestingfurtherparametersfromtheclientexceptforsomeupdatedvaluesoftheatomcoordinates.ThesizeofthedatastructuresinOpalgrowswiththeproblemsizeasshowninthetablebelow:

Order

2n2

Constantc[Bytes]2*43*83*82*82*8

pairlist

atomcoordinatesatomgradientsatominteractionsenergyvalues2γcncn

c13γ2n2

c

c1

LargeExample6290masscenters

[Bytes]160’000’000

400’000400’00040’000’000

16

Amoreaccuratespacecomplexitymodelwouldonlybeusefulifsomeinterestingtradeoffsbetweenspaceandtimecomplexitycouldbeidentified.Wedidnotfindanyinterestingtime-spacetradeoffs,exceptfortheobvioussizeoftheworking-setsthatinfluenceexecutionspeedthrougheffectsofthememoryhierarchylikeswappingofrealphysicalmemory(DRAM)forlargevirtualmemoriesandtheeffectsofthetwolevelsofcaches.

WeransometestswiththesingleprocessorversionofOpalonourPentiumPCplatformstoinvestigatethecompu-tationalperformanceofthemostsignificantloop(comp

1000900800700Time (sec)Measured time vs. analytical model time:(large, no cut-off, full update)1000900800700Time (sec)6005004003002001000Measured time vs. analytical model time:(large, no cut-off, partial update)600500400300200100012345Number of servers

(a)

6712345Number of servers

(b)

67605040Time (sec)Measured time vs. analytical model time:(large, with cut-off, full update)25Measured time vs. analytical model time(medium problem, with cut-off, partial update)20Time (sec)15302010012345Number of servers67105012345Number of servers67measurement(c)analytical modelmeasurement(d)analytical modelFigure4:Thedifferencebetweenthewallclocktimesmeasuredandthetimespredictedbytheanalyticalmodel.

Theabsoluteandrelativecomputationalperformancesbasedondifferentworkingsetsarestatedinthesubsequenttable:

WorkingSetMByte

incache3incoreoutofcore

50K8M120M

ComputationalRateonPentium200[MFlop/s]2

35328

Relative

1.091.000.25

Afterafewtrialrunswithdifferentmemoryconfigurations,itappearedtousthatblockingOpalforthecachesisnotbeneficialforenhancingtheperformance.ItseemsthattheinnerloopofOpalremainsCPUlimitedinsteadofnotmemorylimited.ThisobservationisalsoconfirmedbyanearlydoubledperformancefortwinprocessorPCnodes(seeupcomingsections).TheperformancebreakdownfortheoutofcoreversionofOpalissodrastic,thatsuchproblemsizeswouldpushtheexecutiontimeimmediatelybeyondthelimitforacceptableturnaroundforonesimulation.

11

Onmostsystemswecouldusethehardwareperformancemonitoringhardwaretoaccountmoreaccuratelyforcachemissesandstronglyconfirmthevalidityofourexperimentsabove.Againtheseobservationsemphasizedtheneedtoinstrumentmiddlewareforperformancemonitoringrightfromthebeginningofanapplicationslivecycle,i.e.assoonastheoriginalapplicationcodeisdesigned,implementedandparallelized.Theauthorsofthispapercanthinkofafewsuitableblockingcodetransformationthatwouldenhancelocalityforbetteruseofthecaches,butwestoppedourinvestigationafterthetrialsmentioned,sincewearenolongerincontroloftheOpalproductioncodedistributionandthereforeweseenopathtoshipourimprovementsintotherealworld.

3Integratingperformanceinstrumentationwithapplicationdesign

3.1

Thecruxofoverheadsandlossoftransparencyduetomiddleware

TheclientserverstructureofparallelOpalisideallysuitedforSciddle[4],aremoteprocedurecall(RPC)systemextensiontothePVMcommunicationlibrary.Sciddlecomprisesastubgenerator(theSciddlecompiler)andarun-timelibrary.Thestubgeneratorreadstheremoteinterfacespecification,i.e.,thedescriptionofthesubroutinesexportedbytheservers,andgeneratesthecorrespondingcommunicationstubs.ThestubstakecareoftranslatinganRPCintothenecessaryPVMmessagepassingprimitives.TheapplicationdoesnotneedtousePVMdirectlyforRPCs:theclientsimplycallsasubroutine(providedbytheclientstub),andtheSciddlerun-timesysteminvokesthecorrespondingserversubroutine(viatheserverstub).Itwas,however,adeliberatedecisioninSciddletolettheapplicationwriterscodetheprocessmanagement(startingandterminatingofservers)directlywithPVMcalls.ThereforeaSciddleapplicationstillneedstouseafewPVMcallsatthebeginningandtheendofarun.WhytousemiddlewarelikeSciddle

Sciddleisahighlyportablecommunicationlibrary.IthasbeenportedtoLinuxPCs,UNIXworkstations,theIntelParagon,andsupercomputersliketheCrayJ90andtheNECSX-4.Inparticular,SciddlesupportsbothPVMsystemsavailablefortheCrayJ90SMPs,thenetworkPVMandthesharedmemoryPVM.BasedonourexperiencestheSciddle/PVMcombinationmightsoundlikeaverysuboptimalsolutionforasingleJ90Classicsystem,thatalsosupportscoherentsharedmemorywellwithinasinglesystem.HoweveratthetimeOpalwasindevelopment,oursitewasoperatingfourCrayJ90sinterconnectedbyHIPPIandthedevelopershadcertainlyplanstousetheirparallelOpalversiononaclusteroffourJ90SMPswith48processorstotal.ForsuchaclusterofSMPsmessagepassingisamustandsharedmemorywouldnotdo.Formostapplicationcodes,theadditionaloverheadofSciddleisverysmall[3],butSciddlecausesalackofcontroloverthePVMoptionflagsandPVMinternaloperations(i.e.theproperuseofdatainplace,sharedmemoryflags).WithaspecificsyntheticRPCtest,Sciddlerunscommunicationatabout7MByte/swhichisjustaboutasmuchastheSciddledevelopersgotoutofaPVMpingpongontheJ90[3].ThereforeweattributethedisastrouslylowcommunicationperformancefortheJ90(anSMPmachinewithafastcrossbar)totheunpredictableperformanceoftheCrayPVMimplementationandtheunfortunateinteractionbetweenmiddlewareandcommunicationlibrary.

WeunderstandthatduetoitsinternalarchitectureanditsAPIPVMisfarawayfromazero-copymessagepassingsystem.ThereforewesuggestedtothedevelopersthatOpalisrewritteninaclean“post-in-advance”styleofMPIprogramming.

3.2Apleaforincludingofhardwareperformanceinstrumentationintomiddleware

ThetaskingfacilitiesofPVMandSciddleinterferewiththenormaluseofperformancemonitoringtoolssuchastheHPMcommandontheCrayJ90systemsorthecorrespondingtoolsontheCrayT3EMMPorIntelPCplatform.WeworkedwiththeimplementorsofSciddletointegratequeriestothelowoverheadcounterdevice(e.g./dev/hpm)intotheSciddlecodeandundertakethenecessaryaccountingforthenumberoffloatingpointoperationsexecutedandfortheclockcyclesusedintheclientandintheservers.InasettingwithahighlevelabstractionRPCmodeltheperformanceinstrumentationmustadheretothesamehighlevelabstractionsandthereforebeintegratedintothemiddlewareaswellastheapplicationcode.ThequestionofagoodAPIstandardforperformancemonitoringinstrumentationisstillanopenone.Debuggersposeaverysimilarsoftwareengineeringproblemfortheparallelprogrammingworld.

12

SamplingbasedtoolsgiveadirectestimateforthecomputerateinMFlop/sandareeasytouse,buttheyareextremelycomplextounderstandinsufficientdepth.Sampledcomputationratesarenosubstituteforthesimpleratioofopera-tionscounteddividedbythecyclesused.ThecharacteristicperformanceofdifferentmachineforOpalrunsinTable1inSection4showshowdifficultitistomeasureMFlopcounts.Thenumberoffloatingpointoperationsrequiredtocomputeexactlythesameapplicationresultsdifferssignificantly,becauseofvectorizingtransformationsandthedifferentimplementationsforintrinsicfunctionslikesqrt()andexponentiate().Withastand-aloneperformancemon-itoringtoolwehadjustbelievedtheMFlop/sfiguresmeasuredandhadpossiblyneverlearnedofthatfact,butjustwonderedwhytheMFlop/snumbersdidnotmakemuchsense.

3.3Thedisadvantageofoverlappedcommunicationandcomputation

ThereisnodoubtthattheSciddlepackageacceleratedthedevelopmentoftheapplicationandthatitsadditionaloverheadstayswellwithinacceptablelimitscomparedtothePVMmessagepassingsystem.HoweverpackageslikeSciddlesupportandencouragetheoverlapofcomputationandcommunicationpreventingadetailedquantificationandcorrectaccountingoftheelapsedtimeforlocalcomputation,communicationandidlewaitsduetoloadimbalance.IntheparallelprogrammingframeworkSciddlewasconceivedfor,itmightbeeasytomeasureandaccumulatehighlevelmetricslikeservercomputationrate,clientcomputationratefortheentireapplicationprogram,butlowlevelindicatorslikecommunicationefficiency,idletimes,andloadimbalanceofsinglepartsaremuchhardertoget.Thelattermetricsaremorerelevantintheperformanceanalysis.InordertofindasolutiontothedifficultiestomeasureandquantifyoverheadsinSciddle,weproposeamodificationtothetimingsynchronizationbehaviortofixthisprob-lem.TheSciddleenvironmentdoesnotprovideexplicitsynchronizationtools,butitallowsadirectcommunicationwiththeunderlyingPVMenvironmentandthereforepermitsexplicitsynchronization.WeintroduceadditionalPVMbarrierstoseparatethecommunicationclearlyfromthecomputation:withthesechangestotheSciddlecommunica-tionenvironment,itispossibletomeasureorcomputeallthesemetricsdirectly.Thebarrierfunctionletstheserverssynchronizethemselvesexplicitlywitheachother.

Manypapershavebeenwrittentoshowhowtoeliminatebarriersandpermitmoreoverlapofcommunicationandcomputation,butthepotentialbenefitofoverlapisoftenoverestimatedbecauseofmemorysystembottlenecksinmostmachines.Fortheoptimalaccountingoftimesamongtheclientandtheserversweareforcedtogiveupsomeoftheoverlap.Tous,theaccuracy,predictabilityandtightcontrolofperformanceappearsmoreimportantandwehappilyacceptasmallslowdown(lessthan5%)overtheoverlappedapplicationforthesakeofasolidunderstandingwhatisgoingonwithperformanceinthecode.

Theuseofasharedcommunicationchannelbetweenserversandclientintroducescontentionduetolimitedresourceattheendofaclientcomputephase.ThiscontentionisapplicationspecificanditislikelytosurfacewiththeoriginalSciddleimplementationifalltheserversperformtheexactlythesameamountofwork(i.e.whentherearenoserveridletimevalues).ThebarriersinthemodifiedSciddleframeworkdonotactuallycausethiseffect,butmerelyexposethiscontentionofthecommunicationbetweensingleclientandmultipleserversinallcases.

4PerformancePredictionforAlternativePlatforms

Inthislastpartofourpaper,weuseouranalyticmodeltogetherwithsomestandardperformancedataofalternativecomputerplatformstopredicttheperformanceofOpalinthecasethatwecouldportthecodetothatplatform.TwodifferentclassesofMPP(MassivelyParallelMulti-Processors)areconsideredforourstudyinadditiontotherealOpalplatform,theCrayJ90:First,theCrayT3E,a“bigiron”MPPandsecond,threedifferentflavorsofPCClusterscalled:slowCoPs(ClusterofPCs),SMPCoPsandfastCoPs.WenamedtheonePCClusterslowCoPssinceitisoptimizedforlowestcostandgainsitsperformancebyalargenumberofslowernodes;weaklyconnectedwithashared100BaseTEthernetmedium,itsuniprocessorsaresomeolderIntelPentiumProPCsrunningat200MHz.TheSMPCoPsplatformisbasedonsimilarIntelPentiumProprocessors,butinatwinprocessorconfiguration(2x200MHz)andinterconnectedbyanimprovedSCIsharedmemoryinterconnecttechnology.FinallythefastPCsClusterfeaturessingle400MHzIntelPentiumProPCsasnodes,connectedbyaGigabit/scommunicationsystembasedonfullyswitchedMyrinetinterconnects.ComparableClustersofPCsinstallationsaredescribedin[5,6,16].

13

4.1Extractionofmodelparametersforalternatives

AsshowninSection2.2,theparametersofouranalyticmodelhavebeenintentionallychoseninawaytoincludeallmajortechnicaldatausuallypublishedforparallelmachines.Thisincludesamongothers:messageoverhead,messagethroughputforlargemessages,computationrateforSAXPYinnerloopsandthetimetosynchronizeallparticipatingprocessors.Foreachnewplatformwedeterminethekeyparametersbytheexecutionofafewmicro-benchmarks,verifiedagainstpublishedperformancefigures[9].AnoverviewofthedatausedisfoundintheTables1and2.

MPP

Time

Type

[s]9.56

(450MHz)

6.18

(100MHz)

10.00

(200MHz)

5.00

(2*200MHz)

4.85

(400MHz)

67

102

65

100

32

50

onsinglenode

[MFlOp/s]

8580a

[MFlOp/s]

52

80

FloatingPoint

Rate

Relative

ComputationRate

fall98ourJ90Classicsarescheduledforanupgradetothenewvectorprocessorsthatsignificantlyenhancethe

computationalthroughput.Theperformanceimprovementovertheclassicprocessorisexpectedtobesix-fold.

aFor

MPP

onsinglenode

Type

CrayT3E-900(MPI)

2000/8

SlowCoPs(Ethernet)

50

FastCoPs(Myrinet)

MByte/s

onsinglenode

(observed)100

10msec

3

25µsec

30

speed-upachievedonaCrayJ90.

AsformanyscientificcodesoneroutineofOpaldominatesthecomputeperformance.Thisroutinehasbeenbench-markedoneachplatformusingthemostaccuratecyclecountersandfloatingpointperformancemonitoringhardwarethatisactuallypresentonallfourmachinetypes.Themostimportantsurprisehasbeenasignificantdifferenceinfloatingpointoperationsforthedifferentplatformsalthoughthearithmeticwas64bitinallcasesandtheresultswerepreciselyidentical(orwithinthefloatingpointepsilonforcomparisonsbetweenCrayandIEEEarithmetic).ThedifferencesareduetothedifferentcompilersanddifferentruntimelibrarieswithintrinsicsfunctionsWeeliminatethisdifferencebyassumingthatthebestcompiler(i.e.thePGIcompilerforthePCs[10])issettingalowerboundforthecomputation:weadjustthelocalcomputationrate(MFlop/s)ofotherplatformsaccordingly.

Thecommunicationperformanceisevenmoredifficulttocompareinrealapplications.SomeunfortunateinteractionsbetweenmiddlewareandPVMlibraryreducesthemeasuredcommunicationrateontheCrayJ90processortoabout3MByte/s,despiteitsmorethanoneGByte/sstrongcrossbarinterconnectbetweenthe8processorboardsandthememorybanks.TheauthorsoftheSciddlemiddlewareclaimthattheymeasuredupto7MByte/sforasyntheticSciddleRPCexampleandthatthisratematchesjustabouttheperformanceofrawPVM3.0onthesamemachine[3].Itcertainlyremainsbelowwhatthismachineiscapableofinsharedmemorymode.

WesuspectthatwiththerightconfigurationofPVMflagsoratleastwitharewriteofthemiddlewaretouseMPIintruezerocopymode,wecouldsignificantlyimprovetheperformanceofOpalontheJ90,butsuchworkisoutsidethescopeofthisperformancestudy.FortheotherplatformsweassumedanMPIorPVMbasedre-implementationwithoutSciddleanddeducedourperformancenumbersmainlyfromMPImicro-benchmarksgatheredbyourstudentsandfromsimilarnumberspublishedbyindependentresearchersontheInternet(e.g.[9]).

4.2Discussion

Thecomplexitymodelincorporatesthekeytechnicaldataofmostparallelmachinesasparameters.Thereforeitiswellsuitedforperformanceprediction.

InthefirsttwographsofFigures5a)-d)welookatthepredictedexecutiontimesfor10Opaliterationswithamediumsizemolecule.PlatformsincludetheCrayT3EMPP,theCrayJ90vectorSMP(reference)andthreeclustersofPCs(fastCoPs,SMPCoPsandslowCoPs).Sincewealsolisttheabsoluteexecutiontimeinseconds,wecandirectlycomparetheperformanceofallfourplatformswhen1-7processorsareusedinCharts5a)and5c).ThesuccessorfailureoftheparallelizationoftheOpalcodebecomesmostevident,whenweplotarelativespeed-upwith1-7processorsintheCharts5b)and5d).Thewellspecifiedsynchronizationmodelguaranteesthatwearenotsubjecttothepitfallsofabadlychosenuni-processorimplementation.

IntheupperCharts5a)and5b)thecut-offradiusistoolargetoreducecomputationandthereforetherunsarelargelycomputebound.TheexecutiontimereflectsthedifferentcomputeperformanceofthedifferentnodeprocessorswithaslightedgefortheSMPCoPsarchitecturewhichbecomesslighterandslighterwiththeincreaseoftheprocessorsnumber.Aentirelycomputeboundoperationinevitablyleadstoexcellentspeedup,asseeninChart5b).

ThemainusersofOpalinmolecularbiologyassuredtousrepeatedly,thatforcertainproblemsasimulationwith

˚cut-offparameterisaccurateenoughtogivenewinsightsintotheproteinsstudied.Thereforeweranthea10A

secondtestcaseofthesamemoleculewithacomputationreducedbycut-off.InthelowertwoCharts5c)and5d)thecomputationisacceleratedwithaneffectivecut-offparameterandthereforegraduallybecomescommunicationboundastheparallelismincreases.Inthiscasethecommunicationperformanceofthemachinedoesmatteralot.TheCrayJ90andtheslowCoPs(Ethernet)ClusterofPCsareseverelylimitedbytheirslowcommunicationhardwareorbytheirbadsoftwareinfrastructureformessagepassing.Thisisvisibleinpredictedexecutiontimes:assoonasthenumberofprocessorsincreasesandexceedsthevalueofthree,theoverallexecutiontimeoftheapplicationontheCrayJ90andtheslowCoPs(200MHzwithEthernet)isnolongerdecreasingbutratherincreasing.Theincreaseofthecommunicationtimeoffsetsanygainduetoparallelexecutionandleadstoanoveralllossofperformanceforalargernumberofnodes.Thisaspectisdisplayedbysomespeed-upcurvesinCharts5d)whichactuallyturnintoslow-downcurveswhentoomanynodesareadded.Forthesetwoarchitecturesweachievenobenefitinputtingmorethanthreeprocessorsatwork.

ForasmallnumberofprocessorstheSMPCoPsandFastCoPsarchitecturesstartoutwithabetterexecutiontimethanthebigMPPandvectorSMPirons,possiblyduetothebettercompiler.HoweverwiththeincreaseofthenumberofprocessorsthespeedoftheCrayT3EMPPcatchesupquiterapidlyduetothebettercommunicationsystem.Thistrendisalsoevidentinthespeed-upcurveswheretheCrayT3Earchitectureachievesbettergainandalmostideal

15

speed-up.Forallplatformswithagoodcommunicationsystemwecanscaletheapplicationnicelyto7processorwithaspeed-upof4orgreater.

600500Relative speed-up400Time (sec)300200100012345Number of servers

(a)

454035Relative speed-up30Time (sec)252015105012345Number of servers

6710

1

2

345Number of serversruFast CoPsSMP CoPs(d)q6

7

wrq!u

Exec. time of Opal on different platforms(medium problem, with cut-off, full update)765432

!ruqw

!ruqw

!ruqw

!ruqw

67Exec. time of Opal on different platforms(medium problem, no-cut-off, full update)76543210

1

2

345Number of servers

(b)

Speed-up of Opal on different platforms(medium problem, with cut-off, full update)

!

r

ru

u!

6

7

wrq!u

Speed-up of Opal on different platforms (medium problem, no cut-off, full update)

!ruq

!ruqww!ruq

w

!ruqw

!ruqw

!ruqw

qw

qw

Slow CoPs!wCray T3ECray J90Slow CoPs(c)Figure5:PredictedexecutiontimeforanOpalsimulationofamediumproblemsizemolecule.

AswecanseeinthetwoGraphs5c)andd),speed-upcurvedcannotbeinterpretedproperlywithoutlookingatthe

absoluteexecutiontimessimultaneously;whiletheCrayT3EMPPhasbyfarthebestspeed-up,itstillendsupbehindfastCoPsandSMPCoPswhencomparingabsoluteperformanceforsevenservers.

ThesameperformancescalabilityrelationshipsarereflectedintheFigures6a)-d)foralargesizeproblem.Thechartsshowpredictedexecutiontimesandspeed-upsforalargeproblem.AcomparisonbetweentheCharts6a)-d)and5a)-d)showshowthebehavioroftheexecutiontimeremainsquitesimilartothemediumsizeproblem.Atthesametimewenoticethattheincreaseoftheamountofthecomputationforalargesizeproblemleadtoslightlybetterspeed-upsinChart6b).Stillbothchartsindicateflatspeed-upformoreprocessorsduetooverheadinthecommunicationsystems.InChart6d)wedonothavetheextremeslowdownseenininChart5d),butwecanconcludethattheincreaseoftheamountofthecomputationhasjustpushedthepointofthebreakdownfurtheroutwardsonthecurve.Withalargernumberofprocessorswewouldprobablyencounterthesamesaturationpointatwhichaddingprocessorswouldstoptoincreaseperformance.

16

16001400Exec. time of Opal on different platforms:(large molecule, no-cut-off, full update)76Relative speed-up543210

Speed-up of Opal on different platforms(large molecule, no cut-off, full update)

!ruqw!ruqw!ruqw!ruqw!ruqwwrq!u!ruqw1200Time (sec)1000800600400200012345Number of servers

(a)

6712

345Number of servers

(b)

67

8070Exec. time of Opal on different platforms:(large molecule, with cut-off, full update)76Relative speed-up543210

Speed-up of Opal on different platforms(large molecule, with cut-off, full update)

!!r60Time (sec)5040302010012Cray T3ECray J90345Number of servers

Fast CoPsSMP CoPs(c)

67!r!ruqw!ruqw

uqwruu!ruqwwrq!uqwqw1!w2Cray T3ECray J90345Number of serversruFast CoPsSMP CoPs(d)q67Slow CoPsSlow CoPsFigure6:PredictedexecutiontimeforanOpalsimulationofalargeproblemsizemolecule.

5Conclusion

OurcasestudyofOpalshowedcommonproblemswiththeperformanceinstrumentationinanapplicationsettingwithRPCmiddlewareforparallelizationandPVMcommunicationlibraries.Somemiddlewarehadtobeinstrumentedwithhooksforperformancemonitoringandtheoverlapofcommunicationandcomputationhadtoberestrictedslightlyforareliableaccountingofexecutiontimes.Wecanstatethreepotentialbenefitsoftheintegratedapproachforaccurateperformanceevaluation,modelingandpredictioninparallelprogramming:firstly,theanalyticcomplexitymodelandacarefulinstrumentationforperformancemonitoringleadstoamuchbetterunderstandingoftheresourcedemandsofaparallelapplication.Werealizethatthebasicapplicationwithoutcut-offisentirelycomputeboundandthereforeparallelizeswellregardlessofthesystem.Theoptimizationwithanapproximationalgorithmusinganeffectivecut-offradiuschangesthecharacteristicsofthecodeintoacommunicationcriticalapplicationthatrequiresastrongmemoryandcommunicationsystemforgoodparallelization.Secondly,wediscoveredinterestinganomaliesintheimplementation,e.g.theloadimbalanceforevennumberofserversandthedifferingnumberoffloatingpointoperationsfordifferentprocessors.Thirdly,wecanuseourmodeltopredictwithgoodcertaintyhowtheapplication

17

wouldrunonslowCops,SMPCoPsandfastCoPs,threelowcostClusterofPCsplatformsconnectedbyGigabitNetworks,likeSCIorMyrinet.ThemigrationoftheOpalsimulationcodetotheclusterofPCplatformcouldpotentiallyfreeourupgradedCrayJ90SMPvectormachinesformorecomplexandmemoryintensivecomputationswithlessregularity.ThepredictedexecutiontimesandspeedupfiguresindicatethatawelldesignedclusterofPCsachievessimilar,ifnotbetterperformancethantheJ90ClassicvectorprocessorscurrentlyusedforOpalandthatthecomputationalefficiencycomparesfavorablyeventotheT3E-900forthisparticularapplicationcode.

Acknowledgments

Wewouldliketoexpressourthankstoallthepeoplewhohelpedusduringthiswork.WeareverygratefultoPeterArbenz,WalterGander,HansPeterL¨uthi,andUrsvonMatt,whocreatedSciddletoparallelizeOpal,fortheirhelpandparticularlyPeterArbenzandUrsvonMattforreadingcarefullythroughseveraldraftsofourwork.WesincerelythankMartinBilleter,PeterG¨untert,PeterLuginb¨uhlandKurtW¨uthrichwhocreatedOpalandparticularlyPeterG¨untertforhishelpandhischemistryadvice.WethankCarolBeatyoftheSGI/CRIandBrunoL¨opfeoftheETHRechenzentrumwhohelpedwithourmanyquestionsabouttheCrayJ90andCrayPVM.WearealsoverygratefultoNickNystromandSergiuSanieleviciofthePittsburghSupercomputerCenterwhosponsoredourparameterextractionrunsfortheperformancepredictionoftheCrayT3E-900.

References

[1]P.M.Alsing.N-bodyproblem:Forcedecompositionmethod.1995.http://www.phys.unm.edu/phys500/-lecture4/forcedecomp

docs,/pgiws

[12]P.Luginb¨uhl,P.G¨untert,andM.Billeter.OPAL:User’sManualVersion2.2.ETHZ¨urich,Institutf¨urMoleku-larbiologieandBiophysik,Zrich,Switzerland,1995.[13]P.Luginb¨uhl,P.G¨untert,M.Billeter,andK.W¨uthrich.Thenewprogramopalformoleculardynamicssimula-tionsandenergyrefinementsofbiologicalmacromolecules.J.Biomol.NMR,1996.ETH-BIBP820203.[14]D.O’Hallaron,J.Shewchuk,andT.Gross.Architecturalimplicationsofafamilyofirregularapplications.In

Proc.4ndSymp.onHighPerformanceComputerArchitecture,pages?–?,LasVegas,Jan1998.IEEE.ExtendedversionappearedasTechnicalReportCMU-CS-97-198,CarnegieMellonSchoolofComputerScience.[15]PlimptonS.andHendricksonB.Anewparallelmethodformoleculardynamicssimulationofmacromolecular

systems.SandiaThechnicalReport,SAN94-1862,1994.[16]Sobalvarro,Pakin,Chien,andWeihl.Dynamiccoschedulingonworkstationclusters.ProceedingsoftheInter-nationalParallelProcessingSymposium(IPPS’98),March30-April31998.[17]M.Taufer.Parallelizationofthesoftwarepackageopalforthesimulationofmoleculadynamics.Technical

report,SwissFederalInstituteofTechnology,Zurich,1996.[18]U.vonMatt.Sciddle4.0:User’sguide.Technicalreport,SwissCenterforScientificComputing,Zurich,1996.[19]P.K.WeinerandPA.Kollman.Amber:Assistedmodelbuildingwithenergyrefinement.ageneralprogramfor

modelingmoleculesandtheirinteractions.J.Comp.Chem.,(2),1981.

Authorbiographies

MichelaTauferreceivedherbachelorsandmastersdegreesincomputerscienceengineeringfromUniversityofPadua,Italyin1996.SheiscurrentlyadoctoralstudentattheSwissFederalInstituteofTechnology(ETH)inZrich,Switzer-landandisworkingonhighperformancecomputinganddatabaseapplicationsforclustersofPCs.

ThomasStrickeriscurrentlyanassistantprofessorofcomputerscienceattheSwissFederalInstituteofTechnology(ETH)inZrich.Hisresearchgroupisinves-tigatingarchitecturesandapplicationsofclustersofPCsthatareinterconnectedwithgigabitinterconnecttechnologies.ThomasStrickerattendedCarnegieMel-lonUniversityinPittsburgh,USAforhisPh.D.studies,whereheparticipatedinseverallargesystemsbuildingprojectsincludingtheconstructionoftheiWarpparallelmachines.HealsoholdsundergraduatedegreesfromETHinZrichandisamemberoftheACMSIGARCH,SIGCOMMandtheIEEEComputerSociety.

19

因篇幅问题不能全部显示,请点此查看更多更全内容

Copyright © 2019- howto234.com 版权所有 湘ICP备2022005869号-3

违法及侵权请联系:TEL:199 1889 7713 E-MAIL:2724546146@qq.com

本站由北京市万商天勤律师事务所王兴未律师提供法律服务