您的当前位置：首页 Accurate Performance Evaluation,-Modeling and-Prediction of a Molecular Simulation coded wi

Accurate Performance Evaluation,-Modeling and-Prediction of a Molecular Simulation coded wi

来源：好土汽车网

Appears in: Proceedings of Supercomputing 98, IEEE/ACM Conference on Supercomputing7-13 Nov 1998, Orlando, FL, USA

AccuratePerformanceEvaluation,-Modelingand-PredictionofaMolecularSimulationcodedwithMessagePassingMiddleware

MichelaTauferandThomasStrickerLaboratoryforComputerSystemsSwissInstituteofTechnology(ETH)CH-8092Zuerich,Switzerlandtaufer@inf.ethz.ch,tomstr@acm.org

Abstract

Indistributedandvectorizedcomputingthereisalargenumberofhighlydifferentsupercomputingplatformsanapplicationcouldrunon.Thereforemosttraditionalparallelcodesareillequippedtocollectdataabouttheirresourceusageortheirbehavioratruntimeandthecorrespondingdataarerarelypublished.Onlyfewcomputationalscientistsexploretheinteractionsoftheirtargetplatformswiththeirapplicationsystematically.Asanimprovementoverthecurrentstateoftheart,weproposeanintegratedapproachtoperformance-evaluation,-modelingand-predictionfordifferentplatforms.Ourapproachreliesonacombinationofanalyticalmodelingandsystematicexperimentationwithfullapplicationruns,applicationkernelsandsomebenchmarks.WeoutlineourmethodologyofperformanceassessmentwithOpal,anexamplecodeinmolecularbiology,developedatourinstitutiontoruninparallelonCrayJ90“Classic”VectorSMPs.BesidesadetailedassessmentofperformanceachievedontheseJ90s,theprimarygoalofourstudywastoﬁndthemoresuitableandpotentiallymorecosteffectivehardwareplatformfortheapplication,inparticulartocheckthesuitabilityofthisapplicationforslowCoPs,SMPCoPsandfastCoPs,threeﬂavorsofClustersofPCsbuiltwithoff-the-shelfmicroprocessorsatourcomputersystemslaboratory.TheperformanceassessmentbasedonourmodelismucheasierthanportingandparallelizingtheapplicationforanewtargetmachineandsowecouldalsoobtainandincludeperformanceestimatesforaT3E-900,ahighendMPPsystem.ThepredictedexecutiontimesandspeedupﬁguresindicatethatawelldesignedclusterofPCsachievessimilarifnotbetterperformancethantheJ90vectorprocessorscurrentlyusedandthatthecomputationalefﬁciencycomparesfavorablytotheT3E-900forthatparticularapplicationcode.

1Introduction

Alargevarietyofdifferentparallelcomputingplatformsmakesitquitedifﬁculttopickthethebestsuitedandmostcosteffectiveparallelcomputerasatargetplatformforanewapplicationcodetorunon.Thequestionofthebestplatformforanapplicationshouldbeaddressedearlyinthedesignprocessassoonasthecharacteristicsoftheapplicationcodebecomeknownduringearlyprototyping.Typicallythescientistsintheapplicationﬁeldareonlyinterestedintheresultsofthecomputationitselfandrarelyinhowfastandhowefﬁcientlytheywerecomputed.Thereforemosttraditionalparallelcodesareillequippedtocollectdataabouttheirresourceusageortheirbehavioratruntimeandthecorrespondingdataarerarelypublished.Thereareafewexceptionstothisrule-like[14],apaperdevotedentirelytothecharacteristicsofalargeFEMapplication.

MostmachinesinhighperformancecomputingandeventodaysPCshavegoodhardwareinstrumentationtocollectallthenecessarydata,butmostsystemvendorsdon’tpromotedirectaccesstothem.Insteadtheyprovidehigh-levelperformancetuningandadvisorytoolswithlittleinformationabouttheirresolution,theiraccuracyandtheirtheoryofoperation.Furthermorethosetoolsofteninterferewiththetaskingsupportoftheparallelizationandcommunicationtools(i.e.themiddlewareforparallelization).Commonotherproblemsincludetheclient/serverparadigmwithfulloverlapofcomputationandcommunicationandmanylatencyhidingmechanisms,thatmakeaccurateanddetailedperformancemeasurementsalmostimpossible.

Asanimprovementoverthecurrentstateoftheart,weproposeanddemonstrateanintegratedapproachtoapplicationdesign,parallelizationandperformanceanalysisusingacombinationofanalyticalmodelingandmeasurements.WestudiedthisintegrationwithOpal[13],anexamplecodeinmolecularbiology,developedatourinstitutiontorunonourfourCrayJ90“Classic”VectorSMPs,with8-16processorseach.BesidesadetailedassessmentofperformanceachievedontheJ90s,theprimarygoalofourstudywastoﬁndthemostsuitableandmostcosteffectivehardwareplatformfortheapplication,inparticulartocheckitssuitabilityforfast,slowandSMPCoPs,twoﬂavorsofClustersofPentiumPCs,builtbyourcomputerarchitecturegroup.

InChapter2wedevelopandpresentananalyticalcomplexitymodeltopredicttheexecutiontimeforOpalsimula-tionswithdifferentinputparameters.Wecalibratethemodelwithasystematicexperimentaldesignandshowwhatwelearnedfromdetailedanalysisofcomputationandcommunicationperformance.InChapter3wediscusstheintegra-tionofperformancemonitoringintomiddlewarepackagesanddespiteoverlappedcommunicationandcomputation.InChapter4weusethemodeltogetherwithsomearchitecturalkeydatatopredicttheefﬁciencyofOpalonalternativeplatformsincludingClustersofPCs.

2.1

Casestudy:Instrumentingamolecularbiologycode

AbriefdescriptionofOpal

Opalisasoftwarepackagetoperformthesimulationofthemoleculardynamicsofproteinsandnucleicacidsinvacuumorinwaterthroughenergyminimization.Opalusesclassicalmechanics,i.e.,theNewtonianequationsofmotion,tocomputethetrajectoriesritofnatomsasafunctionoftimet.Newton’ssecondlawexpressestheaccelerationas:

d2mi

¶rit

Vr1t

rnt

AtypicalfunctionVhastheform:

Vr1

allbonds

∑

C6ij

KΘθ

θ0

improperdihedrals

∑

ri12j

4πε0εrrij

Theﬁrsttermmodelsthecovalentbond-stretchinginteractionalongbondb.Thevalueofb0denotestheminimum-energybondlength,andtheforceconstantKbdependsontheparticulartypeofbond.Thesecondtermrepresents

thebond-anglebending(three-body)interaction.The(four-body)dihedral-angleinteractionsconsistoftwoterms:aharmonictermfordihedralanglesξthatarenotallowedtomaketransitions,e.g.,dihedralangleswithinaromaticringsordihedralanglestomaintainchirality,andasinusoidaltermfortheotherdihedralanglesϕ,whichmaymake360turns.Thelasttermcapturesthenon-bondedinteractionsoverallpairsofatoms.ItiscomposedofthevanderWaalsandtheCoulombinteractionsbetweenatomsiandjwithchargesqiandqjatadistancerij.

AﬁrstserialversionofOpal,Opal-2.6,wasdevelopedattheInstituteofMolecularBiologyandBiophysicsatETHZ¨urich[12].ItwaswritteninstandardFORTRAN-77andoptimizedforvectorsupercomputersthroughafewvector-izableloops.IntheserialcodeofOpal-2.6asingleprocessorrunsthewholecomputation.Opal-2.6spendsmostofthecomputingtimeduringasimulationevaluatingthenon-bondedinteractionsoverallpairsofatomsofthemolecularsystem(thelasttermoftheatomicinteractionfunctionV).Fortunately,thesecalculationsalsoofferahighdegreeofparallelisminadditiontothevectorizableinnerloops.

TheparallelversionofOpal

TheparallelversionofOpal[17,2]distributesitsworkamongmultipleprocessorsinaclient-serversetting:multipleserverssharethecomputationoftheVanderWaalsandCoulombforceswhileoneclientcomputesthefewremaininginteractionsandcoordinatesthework.Thecomputationrepeatsforeverytimestep.

Foramolecularcomplex1ofnatoms,thenumberofnon-bondedinteractionsbetweenatoms,whichmustbeevaluated,isoftheorderofn2.InthenewversionofOpal,thissequentialcomplexityofthemolecularenergiesevaluationisreducedbyneglectingmanyofthenon-bondedinteractionsfromthemolecularenergycomputation:onlythepairsofatoms,whosedistanceislessthanacut-offparameter,aretakenintoaccount.Atﬁrst,thedatadescribingthenon-bondinginteractionparametersbetweenthesolute-solute,solute-solvent,solvent-solventatomspairsarereplicatedonalltheservers.Thisglobalinformation,whosevolumedependsontheproblemsizeanddoesnotscalewiththenumberofprocessors,allowseachservertoachievealargeindependence.Withitsdata,eachserverrunsitstasksofthesimulationrequestingnofurtherparametersateachstepfromtheclientthantheatomcoordinates.

Asimulationproceedsbyrepeatingthesamecomputationtaskscontinuously.Attheendofeachsteptheinformationaboutthetotalenergy,volume,pressureandtemperatureofthemolecularcomplexisdisplayed.Intheﬁrststageofeachsimulationstep,whichwecallupdatephase,eachserverselectsadistinctsubsetoftheatompairs,checksthedistanceamongtheatomsofeachpairandaddsthepairtoitsownlistofallactivepairswhentheatomsarenotbeyondthegivendistancecut-off.Inthesecondstageofthesimulationstep,theserverscomputepartialnon-bondedenergies(VanderWaalsenergyandCoulombenergy)usingthelistofallactivepairs.Attheendofthisstepeachserversendsitspartialresultstotheclientwhichgathersthemandsumsthetotalmolecularenergyofthemolecularcomplexaswellasitsvolume,pressureandtemperature.

Thedataineachlistareupdatedperiodically.TheintervalbetweensuccessiveupdatescanbeselectedbytheuserthroughthesettingofanOpalparametercalledupdate.Thevalueoftheupdateparameterexpressesthenumberofinteractionstepsafterwhichthelistsofallactivepairsareupdated.

Thedistributionoftheatompairsfortheevaluationoftheenergiesduetothenon-bondedinteractionsisdoneusingapseudo-randomstrategy.Randomizationshouldhelptobalancetheworkloadamongtheserversandtoavoidduplicationofwork.

Moreover,withaslightchangeofthemolecularsimulationmodel,i.e.,theuseofwatermoleculesassingleunitscenteredintheoxygenatomsinthesolventinsteadofthreeindividualatoms,weaccomplished:

areducedworkloadoftheservers,

areductioninsizeofthelist(memoryusage),

anincreaseinaccuracyforthemolecularenergycalculationswithsmallcut-offradii.

Alternativeparallelizationsformolecularsimulations

Theparallelizationofthenon-bondedpairwiseenergycomputationthroughthedistributionofthemasscentersamongstseveralprocessorsusedforOpalisnottheonlyparallelizationapproach.Therearethreemainapproachestoaparallelizing:thereplicated-data(RD)methodusedforOpal,inwhichthemasscenters(i.e.atoms)aredistributedamongtheprocessors,thegeometric-orspace-decomposition(SD)method,inwhicheachprocessorconsidersthemasscentersintoitssub-domainduringthesimulationandtheforce-decomposition(FD)methodinwhichtheforcematrixFij(Fijistheforceonmasscenteriduetomasscenterj)ispartitionedbyblocksamongtheprocessors[15,1].

Comparablepackagesinmolecularbiology

WithitsparallelizationtheparallelversionofOpalhasbecomesimilartoAmber[19].Boththecodesallowtheusertocarryoutenergyminimizationandmoleculardynamicsusingthesameanalyticalfunction(see2.1).Moreover,boththecodesusemolecularcomponentsinteractionlists,whichneedperiodicalupdates,andallowtoevaluatetheinteractionsofeachatomwiththerestofthemolecularcomplexintoacut-offdistance.FortheparallelversionofAmber,ageneralizedMPIinterface[7]isusedformessage-passing,whiletheparallelversionofOpalreliesonthePVMinterface,andtheSciddleRPCmiddlewarepackage[4,18].StillbothcodesareexplicitlyparallelandwellsuitablefordistributedmemorymachineswithamessagepassingAPI.

2.2AtimecomplexitymodelforOpal

DuringthedesignandparallelizationofOpalwederivedananalyticaltimecomplexitymodelthatcapturesallessentialparametersoftherealapplication.ThepredictedoutcomeofthemodelistheexecutiontimeofOpalinseconds,writtenasasumofseveralpartialresultvariableswhicharecomputedseparatelyandalsomeasuredseparatelyduringvalidation:

tOPALttotttotttotcompcompsync

ttotcomp,thetotalparallelcomputationtime,isthecomputationtimespentbytheserversservicingthe

requestforthecomputation.Theserversruntworoutinesasparallelwork:theupdateroutinethatupdatesthelistsofatompairs,andtheenergyevaluationroutinethatevaluatesthepartialenergiesofthenon-bondedinteractions(VanderWaalsenergyandCoulombenergy).

ttot

comp

tupdate

tnbint

(2)

Thecomputationtimeoftheupdateroutinealwaysgrowsquadraticwithproblemsizebecauseeachtimetheserversupdatetheirownlist,allthepairsofatomsmustbechecked.Atthesametime,theupdatetimedecreaseslinearlywiththeincreaseofthetimeintervalbetweentwolistupdates.

tupdatenγ

(3)

where:

–sisthenumberofsimulationsteps.

–pisthenumberofserversonwhichruntheOpalapplication.–uisthefrequencyofthelistupdates(updateparameter).

–nrepresentsthenumberofmasscenters(atomsandwatermolecules)ofthewholemolecularcomplex.–γ(gamma)istheratioofnumberofwatermoleculestothetotalnumberofmasscenters.

–a2representsthecomputationtimespenttogenerateapairofatomsandcalculatethedistancebetweenthem.

Ontheotherhand,thetimefortheenergyevaluationroutineissubjecttotheeffectsofthecut-offparameter:thedimensionofthelists,onwhichpairsthepartialenergiesareevaluated,increasesdrasticallywiththeincreaseofthecut-offdistance.Theenergy-evaluationroutinegrowsquadraticuptothenumberofatomswithinthecut-offradiusandlinearbeyondthat.

tnbintnn˜

whenn

n˜

molecularcomplexvolumeaswellasthecut-offparameter.Foroursimulations(see2.5)crossoverhappensforunrealisticnumbersofwatermoleculesorproteinatomsi.e.forsufﬁcientlyhighvaluesofn.Furthermoreareductionoftheupdatefrequencyispossibletoreducethefractionofupdatecomputationarbitrarilyandrestoretherelationof:

tnbinttupdate

Wesummarizethetotalparalleltimeas:

a2u1

2γ

a3n

whenn

n˜

ttot

comp

a3n˜

a2u

12γ

,thetotalcommunicationtime,isthetimespentbythecommunicationprocessesbetweentheclient

andtheserversduringtheentiresimulation.Theclientcallstwodifferentkindsofprocedures(subroutines)thatarerunontheservers:thesubroutineforlistupdatesandthesubroutineforenergyevaluation.Weenhancedthecommunicationenvironmentwithsomesynchronizationtoolsthatallowustoseparatethecommunicationtimesproperlyfromothercomputationandidletimesandthereforepermittoexplainallcommunicationcomponentsprecisely.Moredetailsaboutthesesynchronizationtoolsandtheirunderlyingmodelareexplainedin[17].Thankstothismodeltheresultingcommunicationtimeoftheclient’sRPCscanbedecomposedinto:

ttot

upd

treturn

nbi

treturn

upd

tcall

b1(7)

inadditiontothequantitiesdeﬁnedabovewedeﬁne:

–α(alpha)isthenumberofbytesusedtorepresentthecoordinatesofasingleatom.

–a1isthecommunicationrateincludingtheoverheadinthecommunicationenvironment(SciddleandPVM)

–b1isthecommunicationoverhead,inseconds,usedtotransferanemptyblockfromthesendertothereceiver

FortheupdateRPC,theclientdoesnotretrieveanydatafromtheserverswhentheyarriveattheendoftheupdateroutine:theclientjustwaitsforaresultmessagewhichassurestheendoftheservertasks.

treturn

nbi

a1α

χοµµ

ttot

offourterms:–tstr

upd,

sync

isthesum

thetotaltimetosynchronizetheclientandtheserverswhentheupdateroutinesﬁnish,

–tstr

nbi,

thetotaltimetosynchronizetheclientandtheserverswhentheenergyevaluationroutinesﬁnish.

Weassumethatthefourdifferentsynchronizationtimesdonotdependonthenumberofserversnorontheproblemsize.Ourformulationjuststatesthateachtermincreaseslinearlyinthenumberofsimulationstepsandmoreoverthecontributionoftstrupddecreaseastheupdateparameterdecreases.Weassumethateachsynchronizationprocesstakesaconstanttimeb5.

ttot

thecut-offparameter(approximationproperties),

theupdatefrequencyparameter(communication-computationbalance).

Forthecalculationweconsideronetosevenservers,small,mediumandlargeproblems,full-vs.partial-updatesand

˚vs.alarge,ineffectiveoneat60A.˚Theresponsevariablestwodifferentcut-offradii-asmall,effectiveoneat10A

measuredarethecommunication,parallelcomputation,sequentialcomputation,idletimeandsynchronizationtimeaslistedinthecompositeformula.Theexperimentsalwaysrunonadedicatedsystemandthereforethereisnooverheadonthemeasurementsduetoatimesharingenvironment.Inafewpreliminarytests,everymeasurementhasbeenrepeatedseveraltimes.Thetestshaveconﬁrmedalowvariabilityandagoodreproducibilityoftheexecutiontimes,andwethereforehaveconcludedthattensimulationstepssufﬁcetoassureanaccurateandmeaningfultimingofanentiresimulationoftheproteinfoldingprocess,whichwecallasingleexperimentorcaseinthispaper.

2.4UnderstandingtheexecutionofOpalwithourmodel

WeinvestigatetheperformanceoftheOpalcodebymeasurementsofsimulationexecutiontimesoftwomolecularcomplexeswithdifferentsizes:theparallelcomputationtime,thesequentialcomputationtime,thecommunicationtime,thesynchronizationtime,andtheidletime.Wemeasurethedetailedbreakdownofthewallclockexecutiontimefortensimulationsteps.

TheﬁrstmolecularcomplexisamediumsizeexampleofthesimulationproblemsthatOpalcanhandle:itisthecomplexbetweentheAntennapediahomeodomainfromDrosophilaandDNA[8],composedby1575atomsandimmersedin2714watermoleculesoratotalof4289masscenters(mediumproblemsize).Thesecondmolecularcomplexisconsideredtobealargesizeproblem:itistheNMRstructureoftheLFBhomeodomain,composedby1655atomsandimmersedin4634watermolecules,atotalof6289masscenters(largeproblemsize).

Werunthecodefordifferentlevelsofparallelism:thenumberofserversrangesfromonetoseven.Atthesametimewemeasuretheexecutiontimeswhenthesimulationisfullyaccurateandthecomputationcomplexityisquadraticintheproteinsize(i.e.nocut-off)andwhenthesimulationisapproximateandconsequentiallythecomputationcomplexitybecomeslinear(i.e.withcut-off).Finally,weinvestigatetheroleofthelistsupdate:werunthesimulationeitherwithanupdateofourlistsuponeveryiteration(fullupdate)orwithapartialupdateevery10iterations(partialupdate).Thecomparisonofthedifferentcasespermitstostudythescalability,i.e.,theexecutiontimesastheydependonincreasingparallelismandtoinvestigatethepreciseimpactofthedifferentproblemsizes,thefrequencyofthelistsupdateandthevaluesofthecut-offparameterontheperformanceofthesimulation.

Figures1a)-d)displayadetailedbreakdownofthewallclockexecutiontimefor10simulationstepsinthemediumsizemolecularcomplexwithdifferentchoicesforthenumberofservers,thecut-offandtheupdateparameters.ThechartinFigure1a)showsthatwithoutcut-off,thetimeinparallelcomputationisthelargestfractionoftheexecutiontimeandthatitdecreasesasexpectedwhenmoreserversareadded.Atthesametimethecommunicationtimeincreasesaboutlinearwiththenumberofservers,butitsoverallcontributionremainssmall,evenforsevenservers.Thesynchronizationtimeandthesequentialcomputationtimeremaininsigniﬁcanttotheoverallexecutiontime.HowevertothesurpriseoftheOpalimplementors,ourinstrumentationrevealsaloadbalancingproblemforrunswithanevennumbersofprocessors.Figure1b)showsanOpalexecutionwithreducedupdates.Asexpected,thelowerupdatefrequencydoesnotaffecttheoverallperformanceonsimulationsmuchbecausethelargeamountofparallel

˚computationdominatestheexecutiontime.Figure1c)showsasimulationwithaneffectivecut-offparameter(at10A).

Thecut-offparameterdeterminestheasymptoticcomputationalcomplexity:theamountoftheparallelcomputationissmallerthaninthecasesaboveanditsoverallcontributionbecomescomparabletotheothermeasuredexecutiontimes.Thesequentialcomputationtime,thesynchronizationtimeandthecommunicationtimegainahighimportancefortheoverallperformance.Figure1d)displaysarunwithboththecut-offandthepartialupdateoptionineffect.Thefrequencyofthelistupdatesleadstoanotabledifferenceintheperformanceofsimulationswithsmallcut-offradii.Theproblemsizeitself,thenumberofatomsofthewholemolecularcomplex,hasavaryingimpactonthedifferentcomponentsoftheexecutiontime.Thetimecomponents,theparallelcomputationtime,thecommunicationtime,theidletime,thesequentialcomputationtimeandthesynchronizationtime,increaseeachoneinadifferentwaywiththenumberofatoms:whilethesizeoftheproblemhasasuper-linearimpactontheparallelcomputationtimeandtheidletime,ithasonlyalinearmoderateimpactonthesequentialcomputationtimeandthecommunicationtime.Figure2a)-d)showthedetailedbreakdownofthewallclockexecutiontimefor10simulationstepsinthelargesizemolecularcomplex.Whiletheorderofthemeasuredexecutiontimedoublesasweincreasetheproblemsizefroma

Execution time of Opal on Cray J90(medium molecule, no cut-off, full update)

123Number of servers

4567400350300Time (sec)2502001501005001000900Execution time of Opal on Cray J90(large molecule, no cut-off, full update)Execution time of Opal on Cray J90(large molecule, no cut-off, partial update)12Execution time of Opal(large molecule, with12parallel comp. timescalar comp. timeidle time2:Detailedwithaupdate)andwithcut-offtimewehaveadjustedtheshowpredictedbyfrequencies(723thedifferencesofthemodelislisted3456Number of servers

(a)

on Cray J90cut-off, full update)3456Number of servers

communication timesynchronization time(c)breakdownofthemeasuredmolecule

simulationwithoutcut-offparameter(i.e.foreachatomcomputedtheoutcomeoftheforalastsquarecomparisonofthewallclockanalyticalmodelfortheandlargeormediummolecularalthoughthedatawasbetweenmodelandmeasurementtothemeasurementforthe[17].

(b)

on Cray J90cut-off, partial update)234567Number of servers

communication timesynchronization time(d)iterationsofanOpalsimula-interactionsaretherangeof10A

˚considered)versusareconsidered).analyticalmodelforeachoneoftheseexecutiononaCrayJ90SMPagainstnumbersofservers,differentcut-offinthepaper,welistonlythedatadesignof84experiments.Duringandplottedforeachcase.Thefortheremainingcasesisexcellent.800700Time (sec)60050040030020010007605040Time (sec)3020100605040Time (sec)3020100Execution time of Opal(large molecule, with71parallel comp. timescalar comp. timeidle timeFigureexecutiontimesfor10tionlargestep(fullaparameter(i.e.alltheatomsa

Atsimulationonlyinteractionswithinthesamesimulationthroughthecasesandparametersﬁttothecorrespondingmeasurements.

Figures4a)-d)thetimesmeasuredforthethetimesthesamemachinewithdifferentradii,updatecomplexes.Forbrevityofareduced1)-designachievedwithafullfactorialthecalibrationhavebeeninvestigatedoverallﬁtcasesinFigures4a)-d)andThefulldatain9

The parameter space ofOPAL molecular simulationspartial update (0.1)

update frequencyfully effective cut(60 Å)full update (1)

pt)(6200large medium (46no coff (10 Å)utoff iusff radCut oProblem Size)t00 p00 pt)(16Legend:small case & calibration data in paper case shown, calibration not data available in extended reportFigure3:ParameterspacemodelofOpalmolecularsimulation.

2.6SpacecomplexityofOpal

Executiontimeisnottheonlycomplexitymeasurethatcanbetreatedinthismanner.Aspacecomplexitymodelformemoryissueislargelyorthogonaltotheexecutiontimemodel.

Memorymanagementremainsahighlycomplexissueintheparallelization.TheparallelOpalhasbeendesignedtousememoryinthemosteconomicalway:eachserverhasonlyapartofthenon-bondedinteractionspairsofatoms.Thedimensionofthedatalistsoneachserverscalesdownlinearlywiththenumberofprocessors.

Ontheotherhand,eachserverneedsthesameglobaldata(informationaboutthesolution-solution,solution-solvent,andsolvent-solventnon-bondedinteractions)whoseamountofdatadependsontheproblemsize,i.e,thenumberofwaterandmolecularatoms.Thisglobalinformationisconstantanddoesnotscaleupordownwiththenumberofprocessors.Thecomputationoftheseglobaldatastructuresoneachserverinvolvesaduplicationofworkbutsavessomecommunication.Furthermore,thiscomputationoftheseconstantstakesplaceatthebeginningofthesimulation,anditscostisamortizedoveralltimestepsoftheentiresimulation.Oncetheglobaldatainitialized,eachservercanexecutesitscomputationlargelyindependent,becauseitevaluatesitspartialnon-bondedinteractiontermswithoutrequestingfurtherparametersfromtheclientexceptforsomeupdatedvaluesoftheatomcoordinates.ThesizeofthedatastructuresinOpalgrowswiththeproblemsizeasshowninthetablebelow:

Order

2n2

Constantc[Bytes]2*43*83*82*82*8

pairlist

atomcoordinatesatomgradientsatominteractionsenergyvalues2γcncn

c13γ2n2

LargeExample6290masscenters

[Bytes]160’000’000

400’000400’00040’000’000

Amoreaccuratespacecomplexitymodelwouldonlybeusefulifsomeinterestingtradeoffsbetweenspaceandtimecomplexitycouldbeidentiﬁed.Wedidnotﬁndanyinterestingtime-spacetradeoffs,exceptfortheobvioussizeoftheworking-setsthatinﬂuenceexecutionspeedthrougheffectsofthememoryhierarchylikeswappingofrealphysicalmemory(DRAM)forlargevirtualmemoriesandtheeffectsofthetwolevelsofcaches.

WeransometestswiththesingleprocessorversionofOpalonourPentiumPCplatformstoinvestigatethecompu-tationalperformanceofthemostsigniﬁcantloop(comp

1000900800700Time (sec)Measured time vs. analytical model time:(large, no cut-off, full update)1000900800700Time (sec)6005004003002001000Measured time vs. analytical model time:(large, no cut-off, partial update)600500400300200100012345Number of servers

(a)

6712345Number of servers

(b)

67605040Time (sec)Measured time vs. analytical model time:(large, with cut-off, full update)25Measured time vs. analytical model time(medium problem, with cut-off, partial update)20Time (sec)15302010012345Number of servers67105012345Number of servers67measurement(c)analytical modelmeasurement(d)analytical modelFigure4:Thedifferencebetweenthewallclocktimesmeasuredandthetimespredictedbytheanalyticalmodel.

Theabsoluteandrelativecomputationalperformancesbasedondifferentworkingsetsarestatedinthesubsequenttable:

WorkingSetMByte

incache3incoreoutofcore

50K8M120M

ComputationalRateonPentium200[MFlop/s]2

35328

Relative

1.091.000.25

Afterafewtrialrunswithdifferentmemoryconﬁgurations,itappearedtousthatblockingOpalforthecachesisnotbeneﬁcialforenhancingtheperformance.ItseemsthattheinnerloopofOpalremainsCPUlimitedinsteadofnotmemorylimited.ThisobservationisalsoconﬁrmedbyanearlydoubledperformancefortwinprocessorPCnodes(seeupcomingsections).TheperformancebreakdownfortheoutofcoreversionofOpalissodrastic,thatsuchproblemsizeswouldpushtheexecutiontimeimmediatelybeyondthelimitforacceptableturnaroundforonesimulation.

Onmostsystemswecouldusethehardwareperformancemonitoringhardwaretoaccountmoreaccuratelyforcachemissesandstronglyconﬁrmthevalidityofourexperimentsabove.Againtheseobservationsemphasizedtheneedtoinstrumentmiddlewareforperformancemonitoringrightfromthebeginningofanapplicationslivecycle,i.e.assoonastheoriginalapplicationcodeisdesigned,implementedandparallelized.Theauthorsofthispapercanthinkofafewsuitableblockingcodetransformationthatwouldenhancelocalityforbetteruseofthecaches,butwestoppedourinvestigationafterthetrialsmentioned,sincewearenolongerincontroloftheOpalproductioncodedistributionandthereforeweseenopathtoshipourimprovementsintotherealworld.

3Integratingperformanceinstrumentationwithapplicationdesign

3.1

Thecruxofoverheadsandlossoftransparencyduetomiddleware

TheclientserverstructureofparallelOpalisideallysuitedforSciddle[4],aremoteprocedurecall(RPC)systemextensiontothePVMcommunicationlibrary.Sciddlecomprisesastubgenerator(theSciddlecompiler)andarun-timelibrary.Thestubgeneratorreadstheremoteinterfacespeciﬁcation,i.e.,thedescriptionofthesubroutinesexportedbytheservers,andgeneratesthecorrespondingcommunicationstubs.ThestubstakecareoftranslatinganRPCintothenecessaryPVMmessagepassingprimitives.TheapplicationdoesnotneedtousePVMdirectlyforRPCs:theclientsimplycallsasubroutine(providedbytheclientstub),andtheSciddlerun-timesysteminvokesthecorrespondingserversubroutine(viatheserverstub).Itwas,however,adeliberatedecisioninSciddletolettheapplicationwriterscodetheprocessmanagement(startingandterminatingofservers)directlywithPVMcalls.ThereforeaSciddleapplicationstillneedstouseafewPVMcallsatthebeginningandtheendofarun.WhytousemiddlewarelikeSciddle

Sciddleisahighlyportablecommunicationlibrary.IthasbeenportedtoLinuxPCs,UNIXworkstations,theIntelParagon,andsupercomputersliketheCrayJ90andtheNECSX-4.Inparticular,SciddlesupportsbothPVMsystemsavailablefortheCrayJ90SMPs,thenetworkPVMandthesharedmemoryPVM.BasedonourexperiencestheSciddle/PVMcombinationmightsoundlikeaverysuboptimalsolutionforasingleJ90Classicsystem,thatalsosupportscoherentsharedmemorywellwithinasinglesystem.HoweveratthetimeOpalwasindevelopment,oursitewasoperatingfourCrayJ90sinterconnectedbyHIPPIandthedevelopershadcertainlyplanstousetheirparallelOpalversiononaclusteroffourJ90SMPswith48processorstotal.ForsuchaclusterofSMPsmessagepassingisamustandsharedmemorywouldnotdo.Formostapplicationcodes,theadditionaloverheadofSciddleisverysmall[3],butSciddlecausesalackofcontroloverthePVMoptionﬂagsandPVMinternaloperations(i.e.theproperuseofdatainplace,sharedmemoryﬂags).WithaspeciﬁcsyntheticRPCtest,Sciddlerunscommunicationatabout7MByte/swhichisjustaboutasmuchastheSciddledevelopersgotoutofaPVMpingpongontheJ90[3].ThereforeweattributethedisastrouslylowcommunicationperformancefortheJ90(anSMPmachinewithafastcrossbar)totheunpredictableperformanceoftheCrayPVMimplementationandtheunfortunateinteractionbetweenmiddlewareandcommunicationlibrary.

WeunderstandthatduetoitsinternalarchitectureanditsAPIPVMisfarawayfromazero-copymessagepassingsystem.ThereforewesuggestedtothedevelopersthatOpalisrewritteninaclean“post-in-advance”styleofMPIprogramming.

3.2Apleaforincludingofhardwareperformanceinstrumentationintomiddleware

ThetaskingfacilitiesofPVMandSciddleinterferewiththenormaluseofperformancemonitoringtoolssuchastheHPMcommandontheCrayJ90systemsorthecorrespondingtoolsontheCrayT3EMMPorIntelPCplatform.WeworkedwiththeimplementorsofSciddletointegratequeriestothelowoverheadcounterdevice(e.g./dev/hpm)intotheSciddlecodeandundertakethenecessaryaccountingforthenumberofﬂoatingpointoperationsexecutedandfortheclockcyclesusedintheclientandintheservers.InasettingwithahighlevelabstractionRPCmodeltheperformanceinstrumentationmustadheretothesamehighlevelabstractionsandthereforebeintegratedintothemiddlewareaswellastheapplicationcode.ThequestionofagoodAPIstandardforperformancemonitoringinstrumentationisstillanopenone.Debuggersposeaverysimilarsoftwareengineeringproblemfortheparallelprogrammingworld.

SamplingbasedtoolsgiveadirectestimateforthecomputerateinMFlop/sandareeasytouse,buttheyareextremelycomplextounderstandinsufﬁcientdepth.Sampledcomputationratesarenosubstituteforthesimpleratioofopera-tionscounteddividedbythecyclesused.ThecharacteristicperformanceofdifferentmachineforOpalrunsinTable1inSection4showshowdifﬁcultitistomeasureMFlopcounts.Thenumberofﬂoatingpointoperationsrequiredtocomputeexactlythesameapplicationresultsdifferssigniﬁcantly,becauseofvectorizingtransformationsandthedifferentimplementationsforintrinsicfunctionslikesqrt()andexponentiate().Withastand-aloneperformancemon-itoringtoolwehadjustbelievedtheMFlop/sﬁguresmeasuredandhadpossiblyneverlearnedofthatfact,butjustwonderedwhytheMFlop/snumbersdidnotmakemuchsense.

3.3Thedisadvantageofoverlappedcommunicationandcomputation

ThereisnodoubtthattheSciddlepackageacceleratedthedevelopmentoftheapplicationandthatitsadditionaloverheadstayswellwithinacceptablelimitscomparedtothePVMmessagepassingsystem.HoweverpackageslikeSciddlesupportandencouragetheoverlapofcomputationandcommunicationpreventingadetailedquantiﬁcationandcorrectaccountingoftheelapsedtimeforlocalcomputation,communicationandidlewaitsduetoloadimbalance.IntheparallelprogrammingframeworkSciddlewasconceivedfor,itmightbeeasytomeasureandaccumulatehighlevelmetricslikeservercomputationrate,clientcomputationratefortheentireapplicationprogram,butlowlevelindicatorslikecommunicationefﬁciency,idletimes,andloadimbalanceofsinglepartsaremuchhardertoget.Thelattermetricsaremorerelevantintheperformanceanalysis.InordertoﬁndasolutiontothedifﬁcultiestomeasureandquantifyoverheadsinSciddle,weproposeamodiﬁcationtothetimingsynchronizationbehaviortoﬁxthisprob-lem.TheSciddleenvironmentdoesnotprovideexplicitsynchronizationtools,butitallowsadirectcommunicationwiththeunderlyingPVMenvironmentandthereforepermitsexplicitsynchronization.WeintroduceadditionalPVMbarrierstoseparatethecommunicationclearlyfromthecomputation:withthesechangestotheSciddlecommunica-tionenvironment,itispossibletomeasureorcomputeallthesemetricsdirectly.Thebarrierfunctionletstheserverssynchronizethemselvesexplicitlywitheachother.

Manypapershavebeenwrittentoshowhowtoeliminatebarriersandpermitmoreoverlapofcommunicationandcomputation,butthepotentialbeneﬁtofoverlapisoftenoverestimatedbecauseofmemorysystembottlenecksinmostmachines.Fortheoptimalaccountingoftimesamongtheclientandtheserversweareforcedtogiveupsomeoftheoverlap.Tous,theaccuracy,predictabilityandtightcontrolofperformanceappearsmoreimportantandwehappilyacceptasmallslowdown(lessthan5%)overtheoverlappedapplicationforthesakeofasolidunderstandingwhatisgoingonwithperformanceinthecode.

Theuseofasharedcommunicationchannelbetweenserversandclientintroducescontentionduetolimitedresourceattheendofaclientcomputephase.ThiscontentionisapplicationspeciﬁcanditislikelytosurfacewiththeoriginalSciddleimplementationifalltheserversperformtheexactlythesameamountofwork(i.e.whentherearenoserveridletimevalues).ThebarriersinthemodiﬁedSciddleframeworkdonotactuallycausethiseffect,butmerelyexposethiscontentionofthecommunicationbetweensingleclientandmultipleserversinallcases.

4PerformancePredictionforAlternativePlatforms

Inthislastpartofourpaper,weuseouranalyticmodeltogetherwithsomestandardperformancedataofalternativecomputerplatformstopredicttheperformanceofOpalinthecasethatwecouldportthecodetothatplatform.TwodifferentclassesofMPP(MassivelyParallelMulti-Processors)areconsideredforourstudyinadditiontotherealOpalplatform,theCrayJ90:First,theCrayT3E,a“bigiron”MPPandsecond,threedifferentﬂavorsofPCClusterscalled:slowCoPs(ClusterofPCs),SMPCoPsandfastCoPs.WenamedtheonePCClusterslowCoPssinceitisoptimizedforlowestcostandgainsitsperformancebyalargenumberofslowernodes;weaklyconnectedwithashared100BaseTEthernetmedium,itsuniprocessorsaresomeolderIntelPentiumProPCsrunningat200MHz.TheSMPCoPsplatformisbasedonsimilarIntelPentiumProprocessors,butinatwinprocessorconﬁguration(2x200MHz)andinterconnectedbyanimprovedSCIsharedmemoryinterconnecttechnology.FinallythefastPCsClusterfeaturessingle400MHzIntelPentiumProPCsasnodes,connectedbyaGigabit/scommunicationsystembasedonfullyswitchedMyrinetinterconnects.ComparableClustersofPCsinstallationsaredescribedin[5,6,16].

4.1Extractionofmodelparametersforalternatives

AsshowninSection2.2,theparametersofouranalyticmodelhavebeenintentionallychoseninawaytoincludeallmajortechnicaldatausuallypublishedforparallelmachines.Thisincludesamongothers:messageoverhead,messagethroughputforlargemessages,computationrateforSAXPYinnerloopsandthetimetosynchronizeallparticipatingprocessors.Foreachnewplatformwedeterminethekeyparametersbytheexecutionofafewmicro-benchmarks,veriﬁedagainstpublishedperformanceﬁgures[9].AnoverviewofthedatausedisfoundintheTables1and2.

MPP

Time

Type

[s]9.56

(450MHz)

6.18

(100MHz)

10.00

(200MHz)

5.00

(2*200MHz)

4.85

(400MHz)

102

100

onsinglenode

[MFlOp/s]

8580a

[MFlOp/s]

FloatingPoint

Rate

Relative

ComputationRate

fall98ourJ90Classicsarescheduledforanupgradetothenewvectorprocessorsthatsigniﬁcantlyenhancethe

computationalthroughput.Theperformanceimprovementovertheclassicprocessorisexpectedtobesix-fold.

aFor

MPP

onsinglenode

Type

CrayT3E-900(MPI)

2000/8

SlowCoPs(Ethernet)

FastCoPs(Myrinet)

MByte/s

onsinglenode

(observed)100

10msec

25µsec

speed-upachievedonaCrayJ90.

AsformanyscientiﬁccodesoneroutineofOpaldominatesthecomputeperformance.Thisroutinehasbeenbench-markedoneachplatformusingthemostaccuratecyclecountersandﬂoatingpointperformancemonitoringhardwarethatisactuallypresentonallfourmachinetypes.Themostimportantsurprisehasbeenasigniﬁcantdifferenceinﬂoatingpointoperationsforthedifferentplatformsalthoughthearithmeticwas64bitinallcasesandtheresultswerepreciselyidentical(orwithintheﬂoatingpointepsilonforcomparisonsbetweenCrayandIEEEarithmetic).ThedifferencesareduetothedifferentcompilersanddifferentruntimelibrarieswithintrinsicsfunctionsWeeliminatethisdifferencebyassumingthatthebestcompiler(i.e.thePGIcompilerforthePCs[10])issettingalowerboundforthecomputation:weadjustthelocalcomputationrate(MFlop/s)ofotherplatformsaccordingly.

Thecommunicationperformanceisevenmoredifﬁculttocompareinrealapplications.SomeunfortunateinteractionsbetweenmiddlewareandPVMlibraryreducesthemeasuredcommunicationrateontheCrayJ90processortoabout3MByte/s,despiteitsmorethanoneGByte/sstrongcrossbarinterconnectbetweenthe8processorboardsandthememorybanks.TheauthorsoftheSciddlemiddlewareclaimthattheymeasuredupto7MByte/sforasyntheticSciddleRPCexampleandthatthisratematchesjustabouttheperformanceofrawPVM3.0onthesamemachine[3].Itcertainlyremainsbelowwhatthismachineiscapableofinsharedmemorymode.

WesuspectthatwiththerightconﬁgurationofPVMﬂagsoratleastwitharewriteofthemiddlewaretouseMPIintruezerocopymode,wecouldsigniﬁcantlyimprovetheperformanceofOpalontheJ90,butsuchworkisoutsidethescopeofthisperformancestudy.FortheotherplatformsweassumedanMPIorPVMbasedre-implementationwithoutSciddleanddeducedourperformancenumbersmainlyfromMPImicro-benchmarksgatheredbyourstudentsandfromsimilarnumberspublishedbyindependentresearchersontheInternet(e.g.[9]).

4.2Discussion

Thecomplexitymodelincorporatesthekeytechnicaldataofmostparallelmachinesasparameters.Thereforeitiswellsuitedforperformanceprediction.

IntheﬁrsttwographsofFigures5a)-d)welookatthepredictedexecutiontimesfor10Opaliterationswithamediumsizemolecule.PlatformsincludetheCrayT3EMPP,theCrayJ90vectorSMP(reference)andthreeclustersofPCs(fastCoPs,SMPCoPsandslowCoPs).Sincewealsolisttheabsoluteexecutiontimeinseconds,wecandirectlycomparetheperformanceofallfourplatformswhen1-7processorsareusedinCharts5a)and5c).ThesuccessorfailureoftheparallelizationoftheOpalcodebecomesmostevident,whenweplotarelativespeed-upwith1-7processorsintheCharts5b)and5d).Thewellspeciﬁedsynchronizationmodelguaranteesthatwearenotsubjecttothepitfallsofabadlychosenuni-processorimplementation.

IntheupperCharts5a)and5b)thecut-offradiusistoolargetoreducecomputationandthereforetherunsarelargelycomputebound.TheexecutiontimereﬂectsthedifferentcomputeperformanceofthedifferentnodeprocessorswithaslightedgefortheSMPCoPsarchitecturewhichbecomesslighterandslighterwiththeincreaseoftheprocessorsnumber.Aentirelycomputeboundoperationinevitablyleadstoexcellentspeedup,asseeninChart5b).

ThemainusersofOpalinmolecularbiologyassuredtousrepeatedly,thatforcertainproblemsasimulationwith

˚cut-offparameterisaccurateenoughtogivenewinsightsintotheproteinsstudied.Thereforeweranthea10A

secondtestcaseofthesamemoleculewithacomputationreducedbycut-off.InthelowertwoCharts5c)and5d)thecomputationisacceleratedwithaneffectivecut-offparameterandthereforegraduallybecomescommunicationboundastheparallelismincreases.Inthiscasethecommunicationperformanceofthemachinedoesmatteralot.TheCrayJ90andtheslowCoPs(Ethernet)ClusterofPCsareseverelylimitedbytheirslowcommunicationhardwareorbytheirbadsoftwareinfrastructureformessagepassing.Thisisvisibleinpredictedexecutiontimes:assoonasthenumberofprocessorsincreasesandexceedsthevalueofthree,theoverallexecutiontimeoftheapplicationontheCrayJ90andtheslowCoPs(200MHzwithEthernet)isnolongerdecreasingbutratherincreasing.Theincreaseofthecommunicationtimeoffsetsanygainduetoparallelexecutionandleadstoanoveralllossofperformanceforalargernumberofnodes.Thisaspectisdisplayedbysomespeed-upcurvesinCharts5d)whichactuallyturnintoslow-downcurveswhentoomanynodesareadded.Forthesetwoarchitecturesweachievenobeneﬁtinputtingmorethanthreeprocessorsatwork.

ForasmallnumberofprocessorstheSMPCoPsandFastCoPsarchitecturesstartoutwithabetterexecutiontimethanthebigMPPandvectorSMPirons,possiblyduetothebettercompiler.HoweverwiththeincreaseofthenumberofprocessorsthespeedoftheCrayT3EMPPcatchesupquiterapidlyduetothebettercommunicationsystem.Thistrendisalsoevidentinthespeed-upcurveswheretheCrayT3Earchitectureachievesbettergainandalmostideal

speed-up.Forallplatformswithagoodcommunicationsystemwecanscaletheapplicationnicelyto7processorwithaspeed-upof4orgreater.

600500Relative speed-up400Time (sec)300200100012345Number of servers

(a)

454035Relative speed-up30Time (sec)252015105012345Number of servers

6710

345Number of serversruFast CoPsSMP CoPs(d)q6

wrq!u

Exec. time of Opal on different platforms(medium problem, with cut-off, full update)765432

!ruqw

67Exec. time of Opal on different platforms(medium problem, no-cut-off, full update)76543210

345Number of servers

(b)

Speed-up of Opal on different platforms(medium problem, with cut-off, full update)

wrq!u

Speed-up of Opal on different platforms (medium problem, no cut-off, full update)

!ruq

!ruqww!ruq

!ruqw

Slow CoPs!wCray T3ECray J90Slow CoPs(c)Figure5:PredictedexecutiontimeforanOpalsimulationofamediumproblemsizemolecule.

AswecanseeinthetwoGraphs5c)andd),speed-upcurvedcannotbeinterpretedproperlywithoutlookingatthe

absoluteexecutiontimessimultaneously;whiletheCrayT3EMPPhasbyfarthebestspeed-up,itstillendsupbehindfastCoPsandSMPCoPswhencomparingabsoluteperformanceforsevenservers.

ThesameperformancescalabilityrelationshipsarereﬂectedintheFigures6a)-d)foralargesizeproblem.Thechartsshowpredictedexecutiontimesandspeed-upsforalargeproblem.AcomparisonbetweentheCharts6a)-d)and5a)-d)showshowthebehavioroftheexecutiontimeremainsquitesimilartothemediumsizeproblem.Atthesametimewenoticethattheincreaseoftheamountofthecomputationforalargesizeproblemleadtoslightlybetterspeed-upsinChart6b).Stillbothchartsindicateﬂatspeed-upformoreprocessorsduetooverheadinthecommunicationsystems.InChart6d)wedonothavetheextremeslowdownseenininChart5d),butwecanconcludethattheincreaseoftheamountofthecomputationhasjustpushedthepointofthebreakdownfurtheroutwardsonthecurve.Withalargernumberofprocessorswewouldprobablyencounterthesamesaturationpointatwhichaddingprocessorswouldstoptoincreaseperformance.

16001400Exec. time of Opal on different platforms:(large molecule, no-cut-off, full update)76Relative speed-up543210

Speed-up of Opal on different platforms(large molecule, no cut-off, full update)

!ruqw!ruqw!ruqw!ruqw!ruqwwrq!u!ruqw1200Time (sec)1000800600400200012345Number of servers

(a)

6712

345Number of servers

(b)

8070Exec. time of Opal on different platforms:(large molecule, with cut-off, full update)76Relative speed-up543210

Speed-up of Opal on different platforms(large molecule, with cut-off, full update)

!!r60Time (sec)5040302010012Cray T3ECray J90345Number of servers

Fast CoPsSMP CoPs(c)

67!r!ruqw!ruqw

uqwruu!ruqwwrq!uqwqw1!w2Cray T3ECray J90345Number of serversruFast CoPsSMP CoPs(d)q67Slow CoPsSlow CoPsFigure6:PredictedexecutiontimeforanOpalsimulationofalargeproblemsizemolecule.

5Conclusion

OurcasestudyofOpalshowedcommonproblemswiththeperformanceinstrumentationinanapplicationsettingwithRPCmiddlewareforparallelizationandPVMcommunicationlibraries.Somemiddlewarehadtobeinstrumentedwithhooksforperformancemonitoringandtheoverlapofcommunicationandcomputationhadtoberestrictedslightlyforareliableaccountingofexecutiontimes.Wecanstatethreepotentialbeneﬁtsoftheintegratedapproachforaccurateperformanceevaluation,modelingandpredictioninparallelprogramming:ﬁrstly,theanalyticcomplexitymodelandacarefulinstrumentationforperformancemonitoringleadstoamuchbetterunderstandingoftheresourcedemandsofaparallelapplication.Werealizethatthebasicapplicationwithoutcut-offisentirelycomputeboundandthereforeparallelizeswellregardlessofthesystem.Theoptimizationwithanapproximationalgorithmusinganeffectivecut-offradiuschangesthecharacteristicsofthecodeintoacommunicationcriticalapplicationthatrequiresastrongmemoryandcommunicationsystemforgoodparallelization.Secondly,wediscoveredinterestinganomaliesintheimplementation,e.g.theloadimbalanceforevennumberofserversandthedifferingnumberofﬂoatingpointoperationsfordifferentprocessors.Thirdly,wecanuseourmodeltopredictwithgoodcertaintyhowtheapplication

wouldrunonslowCops,SMPCoPsandfastCoPs,threelowcostClusterofPCsplatformsconnectedbyGigabitNetworks,likeSCIorMyrinet.ThemigrationoftheOpalsimulationcodetotheclusterofPCplatformcouldpotentiallyfreeourupgradedCrayJ90SMPvectormachinesformorecomplexandmemoryintensivecomputationswithlessregularity.ThepredictedexecutiontimesandspeedupﬁguresindicatethatawelldesignedclusterofPCsachievessimilar,ifnotbetterperformancethantheJ90ClassicvectorprocessorscurrentlyusedforOpalandthatthecomputationalefﬁciencycomparesfavorablyeventotheT3E-900forthisparticularapplicationcode.

Acknowledgments

Wewouldliketoexpressourthankstoallthepeoplewhohelpedusduringthiswork.WeareverygratefultoPeterArbenz,WalterGander,HansPeterL¨uthi,andUrsvonMatt,whocreatedSciddletoparallelizeOpal,fortheirhelpandparticularlyPeterArbenzandUrsvonMattforreadingcarefullythroughseveraldraftsofourwork.WesincerelythankMartinBilleter,PeterG¨untert,PeterLuginb¨uhlandKurtW¨uthrichwhocreatedOpalandparticularlyPeterG¨untertforhishelpandhischemistryadvice.WethankCarolBeatyoftheSGI/CRIandBrunoL¨opfeoftheETHRechenzentrumwhohelpedwithourmanyquestionsabouttheCrayJ90andCrayPVM.WearealsoverygratefultoNickNystromandSergiuSanieleviciofthePittsburghSupercomputerCenterwhosponsoredourparameterextractionrunsfortheperformancepredictionoftheCrayT3E-900.

References

[1]P.M.Alsing.N-bodyproblem:Forcedecompositionmethod.1995.http://www.phys.unm.edu/phys500/-lecture4/forcedecomp

docs,/pgiws

[12]P.Luginb¨uhl,P.G¨untert,andM.Billeter.OPAL:User’sManualVersion2.2.ETHZ¨urich,Institutf¨urMoleku-larbiologieandBiophysik,Zrich,Switzerland,1995.[13]P.Luginb¨uhl,P.G¨untert,M.Billeter,andK.W¨uthrich.Thenewprogramopalformoleculardynamicssimula-tionsandenergyreﬁnementsofbiologicalmacromolecules.J.Biomol.NMR,1996.ETH-BIBP820203.[14]D.O’Hallaron,J.Shewchuk,andT.Gross.Architecturalimplicationsofafamilyofirregularapplications.In

Proc.4ndSymp.onHighPerformanceComputerArchitecture,pages?–?,LasVegas,Jan1998.IEEE.ExtendedversionappearedasTechnicalReportCMU-CS-97-198,CarnegieMellonSchoolofComputerScience.[15]PlimptonS.andHendricksonB.Anewparallelmethodformoleculardynamicssimulationofmacromolecular

systems.SandiaThechnicalReport,SAN94-1862,1994.[16]Sobalvarro,Pakin,Chien,andWeihl.Dynamiccoschedulingonworkstationclusters.ProceedingsoftheInter-nationalParallelProcessingSymposium(IPPS’98),March30-April31998.[17]M.Taufer.Parallelizationofthesoftwarepackageopalforthesimulationofmoleculadynamics.Technical

report,SwissFederalInstituteofTechnology,Zurich,1996.[18]U.vonMatt.Sciddle4.0:User’sguide.Technicalreport,SwissCenterforScientiﬁcComputing,Zurich,1996.[19]P.K.WeinerandPA.Kollman.Amber:Assistedmodelbuildingwithenergyreﬁnement.ageneralprogramfor

modelingmoleculesandtheirinteractions.J.Comp.Chem.,(2),1981.

Authorbiographies

MichelaTauferreceivedherbachelorsandmastersdegreesincomputerscienceengineeringfromUniversityofPadua,Italyin1996.SheiscurrentlyadoctoralstudentattheSwissFederalInstituteofTechnology(ETH)inZrich,Switzer-landandisworkingonhighperformancecomputinganddatabaseapplicationsforclustersofPCs.

ThomasStrickeriscurrentlyanassistantprofessorofcomputerscienceattheSwissFederalInstituteofTechnology(ETH)inZrich.Hisresearchgroupisinves-tigatingarchitecturesandapplicationsofclustersofPCsthatareinterconnectedwithgigabitinterconnecttechnologies.ThomasStrickerattendedCarnegieMel-lonUniversityinPittsburgh,USAforhisPh.D.studies,whereheparticipatedinseverallargesystemsbuildingprojectsincludingtheconstructionoftheiWarpparallelmachines.HealsoholdsundergraduatedegreesfromETHinZrichandisamemberoftheACMSIGARCH,SIGCOMMandtheIEEEComputerSociety.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文