Journey To The Center of The Linux Kernel: Traffic Control, Shaping and Qos
Journey To The Center of The Linux Kernel: Traffic Control, Shaping and Qos
Journey To The Center of The Linux Kernel: Traffic Control, Shaping and Qos
JulienVehent[http://jve.linuxwall.info]seerevisions[http://wiki.linuxwall.info/doku.php/en:ressources:dossiers:networking:traffic_control?do=revisions]
1Introduction
ThisdocumentdescribestheTrafficControlsubsystemoftheLinuxKernelindepth,algorithmbyalgorithm,andshowshowitcanbeusedtomanage theoutgoingtrafficofaLinuxsystem.Throughoutthechapters,wewilldiscussboththetheorybehindandtheusageofTrafficControl,anddemonstrate howonecangainacompletecontroloverthepacketspassingthroughhissystem.
aQoSgraph TheinitialtargetofthispaperwastogainabettercontroloverasmallDSLuplink.Anditgrewovertimetocoveralotmorethanthat.99%ofthe informationprovidedherecanbeappliedtoanytypeofserver,aswellasrouters,firewalls,etc TheTrafficControltopicislargeandinconstantevolution,asistheLinuxKernel.Therealcreditgoestothedevelopersbehindthe/netdirectoryofthe kernel,andalloftheresearcherswhocreatedandimprovedallofthesealgorithms.Thisismerelyanattempttodocumentsomeofthisworkforthe masses.Anyparticipationandcommentsarewelcome,inparticularifyouspottedaninconsistencysomewhere.Pleaseemailjulien[at]linuxwall.info, yourmessagesarealwaysmostappreciated. Forthetechnicaldiscussion,sincetheLARTCmailinglistdoesn'texistsanymore,trythosetwo: *Netfilterusersmailinglist[http://vger.kernel.org/vgerlists.html#netfilter]forgeneraldiscussions*NetDevmailinglist[http://vger.kernel.org/vger lists.html#netdev]]iswheremagichappens(developersML)
2Motivation
ThisarticlewasinitiallypublishedinthefrenchissueofGnu/LinuxMagazineFrance#127,inMay2010.GLMFis kindenoughtoprovideacontractthatreleasethecontentofthearticleunderCreativeCommonaftersometime.I extendedtheinitialarticlequiteabitsince,butyoucanstillfindtheoriginalfrenchversionhere
[http://wiki.linuxwall.info/doku.php/fr:ressources:dossiers:networking:traffic_control]
MyinterestforthetrafficshapingsubsystemofLinuxstartedaround2005,whenIdecidedtohostmostofthe servicesIusemyself.Ireadthedocumentationavailableonthesubject(essentiallyLARTC[http://lartc.org/])but founditincompleteandendedupreadingtheoriginalresearchpublicationsandthesourcecodeofLinux. IhostmostofmyInternetservicesmyself,athomeoronsmallendservers(dediboxandsoon).Thisincludes webhosting(thiswiki,someblogsandafewwebsites),SMTPandIMAPservers,someFTPservers,XMPP, DNSandsoon.FrenchISPsallowthis,butonlygiveyou1Mbps/128KBpsofuplink,whichcorrespondstothe TCPAcksratenecessaryfora20Mbpsdownlink. 1Mbpsisenoughformostusage,butwithouttrafficcontrol,anyweeklybackuptoanexternallocationfillsupthe DSLlinkandslowsdowneveryoneonthenetwork.Duringthattime,boththevisitorsofthiswikiandmywife chattingonskypewillexperiencehighlatency.Thisisnotacceptable,becausethepriorityoftheweeklybackupis alotlowerthanthetwoothers.Linuxgiveyoutheflexibilitytoshapethenetworktrafficanduseallofyour bandwidthefficiently,withoutpenalizingrealtimeapplications.Butthiscomeswithaprice,andthelearningcurvetoimplementanefficienttrafficcontrol policyisquitesteep.ThisdocumentprovidesanaccurateandcomprehensiveintroductiontothemostusedQoSalgorithms,andgivesyouthetoolsto implementandunderstandyourownpolicy.
3ThebasicsofTrafficControl
IntheInternetworld,everythingispackets.Managingannetworkmeansmanagingpackets:howtheyaregenerated,router,transmitted,reorder, fragmented,etcTrafficControlworksonpacketsleavingthesystem.Itdoesn't,initially,haveasanobjectivetomanipulatepacketsenteringthe system(althoughyoucoulddothat,ifyoureallywanttoslowdowntherateatwhichyoureceivepackets).TheTrafficControlcodeoperatesbetween theIPlayerandthehardwaredriverthattransmitsdataonthenetwork.Wearediscussingaportionofcodethatworksonthelowerlayersofthe networkstackofthekernel.Infact,theTrafficControlcodeistheveryoneinchargeofconstantlyfurnishingpacketstosendtothedevicedriver. ItmeansthattheTCmodule,thepacketscheduler,ispermanentlyactivateinthekernel.Evenwhenyoudonotexplicitlywanttouseit,it'sthere schedulingpacketsfortransmission.Bydefault,thisschedulermaintainsabasicqueue(similartoaFIFOtypequeue)inwhichthefirstpacketarrived isthefirsttobetransmitted.
Atthecore,TCiscomposedofqueuingdisciplines,orqdisc,that representtheschedulingpoliciesappliedtoaqueue.Severaltypesofqdisc exist.IfyouarefamiliarwiththewayCPUschedulerswork,youwillfind thatsomeoftheTCqdiscaresimilar.WehaveFIFO,FIFOwithmultiple queues,FIFOwithhashandroundrobin(SFQ).WealsohaveaToken BucketFilter(TBF)thatassignstokenstoaqdisctolimititflowrate(no token=notransmission=waitforatoken).Thislastalgorithmwasthen extendedtoahierarchicalmodecalledHTB(HierarchicalTokenBuket). AndalsoQuickFairQueuing(QFQ),HierarchicalFairServiceCurve (HFSC),RandomEarlyDetection(RED),etc. Foracompletelistofalgorithm,checkoutthesourcecodeatkernel.org
[http://git.kernel.org/?p=linux/kernel/git/next/linux next.gita=treef=net/schedhb=HEAD].
3.1Firstcontact
Let'sskipthetheoryfornowandstartwithaneasyexample.Wehavea webserveronwhichwewouldliketolimittheflowrateofpacketsleaving theserver.Wewanttofixthatlimitat200kilobitsperseconds(25KB/s). Thissetupisfairlysimple,andweneedthreethings: 1. aNetfilterruletomarkthepacketsthatwewanttolimit 2. aTrafficControlpolicy 3. aFiltertobindthepacketstothepolicy
3.2NetfilterMARK
Netfiltercanbeusedtointeractdirectlywiththestructurerepresentingapacketinthekernel.Thisstructure,thesk_buff[http://git.kernel.org/? p=linux/kernel/git/next/linuxnext.gita=blobf=include/linux/skbuff.h],containsafieldcalled__u32nfmarkthatwearegoingtomodify.TCwillthenreadthat valuetoselectthedestinationclassofapacket. Thefollowingiptablesrulewillapplythemark'80'tooutgoingpackets(OUTPUTchain)sentbythewebserver(TCPsourceportis80).
#iptablestmangleAOUTPUToeth0ptcpsport80jMARKsetmark80
Wecancontroltheapplicationofthisruleviathenetfilterstatistics:
#iptablesLOUTPUTtmanglev ChainOUTPUT(policyACCEPT74107packets,109Mbytes) pktsbytestargetprotoptinoutsourcedestination 73896109MMARKtcpanyeth0anywhereanywheretcpspt:wwwMARKxset0x50/0xffffffff
Youprobablynoticedthattheruleislocatedinthemangletable.Wewillgobacktothatalittlebitlater.
3.3Twoclassesinatree
TomanipulateTCpolicies,weneedthe/sbin/tcbinaryfromthe**iproute**package [http://www.linuxfoundation.org/collaborate/workgroups/networking/iproute2](aptitudeinstalliproute). Theiproutepackagemustmatchyourkernelversion.Yourdistribution'spackagemanagerwillnormallytakecareofthat. Wearegoingtocreateatreethatrepresentsourschedulingpolicy,andthatusestheHTBscheduler.Thistreewillcontaintwoclasses:oneforthe markedtraffic(TCPsport80),andoneforeverythingelse.
#tcqdiscadddeveth0roothandle1:htbdefault20 #tcclassadddeveth0parent1:0classid1:10htbrate200kbitceil200kbitprio1mtu1500 #tcclassadddeveth0parent1:0classid1:20htbrate824kbitceil1024kbitprio2mtu1500
3.4Connectingthemarkstothetree
Wenowhaveononesideatrafficshapingpolicy,andontheothersidepacketsmarking.Toconnectthetwo,weneedafilter. Afilterisarulethatidentifypackets(handleparameter)anddirectthemtoaclass(fwflowidparameter).Sinceseveralfilterscanworkinparallel,they canalsohaveapriority.AfiltermustbeattachedtotherootoftheQoSpolicy,otherwise,itwon'tbeapplied.
#tcfilteradddeveth0parent1:0protocolipprio1handle80fwflowid1:10
Wecantestthepolicyusingasimplyclient/serversetup.Netcatisveryusefulforsuchtesting.Startalisteningprocessontheserverthatappliesthe policyusing:
#nclp80</dev/zero
Andconnecttoitfromanothermachineusing:
#nc192.168.1.180>/dev/null
Theserverprocesswillsendzeros(takenfrom/dev/zero)asfastasitcan,andtheclientwillreceivethemandthrowthemaway,asfastasitcan. Usingiptraftomonitortheconnection,wecansupervisethebandwidthusage(bottomrightcorner).
4TwentyThousandLeaguesUndertheCode
Nowthatweenjoyedthisfirstcontact,itistimetogobacktothefundamentalsoftheQualityofServiceofLinux.Thegoalofthischapteristodiveinto thealgorithmsthatcomposethetrafficcontrolsubsystem.Lateron,wewillusethatknowledgetobuildourownpolicy. ThecodeofTCislocatedinthenet/scheddirectoryofthesourcesofthekernel.Thekernelseparatestheflowsenteringthesystem(ingress)fromthe flowsleavingit(egress).And,aswesaidearlier,itistheresponsibilityoftheTCmoduletomanagetheegresspath. Theillustrationbelowshowthepathofapacketinsidethekernel,whereitenters(ingress)andwhereitleaves(egress).Ifwefocusontheegresspath, apacketarrivesfromthelayer4(TCP,UDP,)andthenentertheIPlayer(notrepresentedhere).TheNetfilterchainsOUTPUTandPOSTROUTING areintegratedintheIPlayerandarelocatedbetweentheIPmanipulationfunctions(headercreation,fragmentation,).AttheexitoftheNATtableof thePOSTROUTINGchain,thepacketistransmittedtotheegressqueue,andthisiswhereTCstartsitswork. Almostallthedevisesuseaqueuetoscheduletheegresstraffic.Thekernelpossessesalgorithmsthatcanmanipulatethesequeues,theyarethe queuingdisciplines(FIFO,SFQ,HTB,).ThejobofTCistoapplythesequeuingdisciplinestotheegressqueueinordertoselectapacketfor transmission. TCworkswiththesk_buff[http://git.kernel.org/?p=linux/kernel/git/next/linuxnext.gita=blobf=include/linux/skbuff.h]]structurethatrepresentsapacketinthe kernel.Itdoesn'tmanipulatethepacketdatadirectly.sk_buffisasharedstructureaccessibleanywhereinthekernel,thusavoidingunnecessary duplicationofdata.Thismethodisalotmoreflexibleandalotfasterbecausesk_buffcontainsallofthenecessaryinformationtomanipulatethepacket, thekernelthereforeavoidscopiesofheadersandpayloadsfromdifferentmemoryareasthatwouldruintheperformances.
Onaregularbasis,thepacketschedulerwillwakeupandlaunchanypreconfiguredschedulingalgorithmstoselectapackettotransmit. Mostoftheworkinlaunchbythefunctiondev_queue_xmitfromnet/core/dev.c,thatonlytakeask_buffstructureasinput(thisisenough,sk_buff containseverythingneeded,suchasskbdev,apointertotheoutputNIC). dev_queue_xmitmakessurethepacketisreadytobetransmittedonthenetwork,thatthefragmentationiscompatiblewiththecapacityoftheNIX,that thechecksumsarecalculated(ifthisishandledbytheNIC,thisstepisskipped).Oncethosecontrolsdone,andiftheequipmenthasaqueuein skbdevqdisc,thenthesk_buffstructureofthepacketisaddedtothisqueue(viatheenqueuefunction)andtheqdisc_runfunctioniscalledtoselect apackettosend. ThismeansthatthepacketthathasjustbeenaddedtotheNIC'squeuemightnotbetheoneimmediatelytransmitted,butweknowthatitisreadyfor subsequenttransmissionthemomentitisaddedtothequeue. Toeachdeviceisattachedarootqueuingdiscipline.Thisiswhatwedefinedearlierwhencreatingtherootqdisctolimittheflowrateofthewebserver:
#tcqdiscadddeveth0roothandle1:htbdefault20
4.1ClasslessDisciplines
Thisisaprettylongwaydown.Forthosewhowishestodeepenthesubject,IrecommendreadingUnderstandingLinuxNetworkInternals,from ChristianBenvenutiatO'reilly. Wenowhaveadequeuefunctionwhichroleistoselectthepackettosendtothenetworkinterface.Todoso,thisfunctioncallsschedulingalgorithms thatwearenowgoingtostudy.Therearetwotypesofalgorithms:ClasslessandClassful.Theclassfulalgorithmsarecomposedofqdiscthatcan containclasses,likewedidinthepreviousexamplewithHTB.Inopposition,classlessalgorithmscannotcontainclasses,andare(supposedly)more simple.
4.1.1PFIFO_FAST
Let'sstartwithasmallone.pfifo_fastisthedefaultschedulingalgorithmusedwhennootherisexplicitlydefined.Inotherwords,it'stheoneusedon 99.9%oftheLinuxsystems.ItisclasslessandatinybitmoreevolvedthanabasicFIFOqueue,sinceitimplements3bandsworkinginparallel.These bandsarenumbered0,1and2andemptiedsequentially:while0isnotempty,1willnotbeprocessed,and2willbethelast.Sowehave3priorities: thepacketsinqueue0willbesentbeforethepacketsinqueue2. ThekernelusestheTypeofServicefield(the8bitsfieldsfrombit8tobit15oftheIPheader,seebelow)toselectthedestinationbandofapacket.
0123 01234567890123456789012345678901 +++++++++++++++++++++++++++++++++ |Version|IHL|TypeofService|TotalLength|
+++++++++++++++++++++++++++++++++ |Identification|Flags|FragmentOffset| +++++++++++++++++++++++++++++++++ |TimetoLive|Protocol|HeaderChecksum| +++++++++++++++++++++++++++++++++ |SourceAddress| +++++++++++++++++++++++++++++++++ |DestinationAddress| +++++++++++++++++++++++++++++++++ |Options|Padding| +++++++++++++++++++++++++++++++++ ExampleInternetDatagramHeaderfromRFC791
Thisalgorithmisdefinedin**net/sched/sch_generic.c**[http://git.kernel.org/?p=linux/kernel/git/next/linux next.gita=blob_plainf=net/sched/sch_generic.chb=HEAD]andrepresentedinthediagrambelow.
4.1.2SFQStochasticFairnessQueuing
StochasticFairnessQueuingisanalgorithmthatsharesthebandwidthwithoutgivinganyprivilegeofanysort.Thesharingisfairbybeingstochastic,or randomifyouprefer.Theideaistotakeafingerprint,orhashofthepacketbasedonitsheader,andtousethishashtosendthepackettooneofthe 1024bucketsavailable.Thebucketsaresendemptiedinaroundrobinfashion. Themainadvantageofthismethodisthatnopacketwillhavethepriorityoveranotherone.Noconnexioncantakeovertheother,andeverybodyhasto share.Therepartitionofthebandwidthacrossthepacketswillalmostalwaysbefair,buttherearesomeminorlimitation.Themainlimitationisthatthe hashingalgorithmmightproducethesameresultforseveralpackets,andthussendthemtothesamebucket.Onebucketwillthenfillupfasterthanthe other,breakingthefairnessofthealgorithm.Tomitigatethis,SFQwillmodifytheparametersofitshashingalgorithmonaregularbasis,bydefault every10seconds. ThediagrambelowshowshowthepacketsareprocessedbySFQ,fromenteringthescheduleratthetop,tobeingdequeuedandtransmittedatthe bottom.Thesourcecodeisavailableinnet/sched/sch_sfq.c[http://git.kernel.org/?p=linux/kernel/git/next/linux next.gita=blob_plainf=net/sched/sch_sfq.chb=HEAD].Somevariablesarehardcodedinthesourcecode:*SFQ_DEFAULT_HASH_DIVISORgives thenumberofbucketsanddefaultto1024*SFQ_DEPTHdefinesthedepthofeachbucket,anddefaultsto128packets
#defineSFQ_DEPTH 128/*maxnumberofpacketsperflow*/ #defineSFQ_DEFAULT_HASH_DIVISOR1024
4.1.2.1SFQHashingalgorithm
DIAsource
#tcqdiscadddeveth0roothandle1:sfqperturb10quantum3000limit64 #tcfilteradddeveth0parent1:0protocoliphandle1flowhashkeyssrc,dstdivisor1024
4.2ClassfulDisciplines
4.2.1TBFTokenBucketFilter
Untilnow,welookedatalgorithmthatdonotallowtocontroltheamountofbandwidth.SFQandPFIFO_FASTgivetheabilitytosmoothenthetraffic, andeventoprioritizeitabit,butnottocontrolitsthroughput. Infact,themainproblemwhencontrollingthebandwidthistofindanefficientaccountingmethod.Becausecountinginmemoryisextremelydifficulty andcostlytodoinrealtime,computerscientiststookadifferentapproachhere. Insteadofcountingthepackets(orthebitstransmittedbythepackets,it'sthesamething),theTokenBucketFilteralgorithmsends,ataregular interval,atokenintoabucket.Nowthisisdisconnectedfromtheactualpackettransmission,butwhenapacketentersthescheduler,itwillconsumea certainnumberoftokens.Ifthereisnotenoughtokensforittobetransmitted,thepacketwaits. Untilnow,withSFQandPFIFO_FAST,weweretalkingaboutpackets,butwithTBFwenowhavetolookintothebitscontainedinthepackets.Let's takeanexample:apacketcarrying8000bits(1KB)wishestobetransmitted.ItenterstheTBFschedulerandTBFcontrolthecontentofitsbucket:if thereare8000tokensinthebucket,TBFdestroysthemandthepacketcanpass.Otherwise,thepacketwaitsuntilthebuckethasenoughtokens. Thefrequencyatwhichtokensareaddedintothebucketdeterminethetransmissionspeed,orrate.ThisistheparameteratthecoreoftheTBF algorithm,showninthediagrambelow.
DIAsource AnotherparticularityofTBFistoallowbursts.Thisisanaturalsideeffectofthealgorithm:thebucketfillsupatacontinuousrate,butifnopacketsare beingtransmittedforsometime,thebucketwillgetcompletelyfull.Then,thenextpacketstoenterTBFwillbetransmittedrightaway,withouthavingto waitandwithouthavinganylimitappliedtothem,untilthebucketisempty.Thisiscalledaburst,andinTBFtheburstparameterdefinesthesizeofthe bucket. Sowithaverylargeburstvalue,say1,000,000tokens,wewouldletamaximumof83fullyloadedpackets(roughly124KBytesiftheyallcarrytheir maximumMTU)traversetheschedulerwithoutapplyinganysortoflimittothem. Toovercomethisproblem,andprovidesbettercontroloverthebursts,TBFimplementsasecondbucket,smallerandgenerallythesamesizeasthe MTU.Thissecondbucketcannotstorelargeamountoftokens,butitsreplenishingratewillbealotfasterthattheoneofthebigbucket.Thissecond rateiscalledpeakrateanditwilldeterminethemaximumspeedofaburst. Let'stakeastepbackandlookatthoseparametersagain.Wehave: peakrate>rate:thesecondbucketfillsupfasterthanthemainone,toallowandcontrolbursts.Ifthepeakratevalueisinfinite,thenTBF behavesasifthesecondbucketdidn'texist.Packetswouldbedequeuedaccordingtothemainbucket,atthespeedofrate. burst>MTU:thesizeofthefirstbucketisalotlargerthanthesizeofthesecondbucket.IftheburstisequaltoMTU,thenpeakrateisequalto rateandthereisnoburst. So,tosummarize,wheneverythingworkssmoothlypacketsareenqueuedanddequeuedatthespeedofrate.Iftokensareavailablewhenpackets enterTBF,thosepacketsaretransmittedatthespeedofpeakrateuntilthefirstbucketisempty.Thisisrepresentedinthediagramabove,andinthe sourcecodeatnet/sched/sch_tbf.c[http://git.kernel.org/?p=linux/kernel/git/next/linuxnext.gita=blob_plainf=net/sched/sch_tbf.chb=HEAD],the interestingfunctionbeingtbf_dequeue.c.
TheconfigurableoptionsoftheTBFschedulerarelistedinTC:
#tcqdiscaddtbfhelp Usage:...tbflimitBYTESburstBYTES[/BYTES]rateKBPS[mtuBYTES[/BYTES]] [peakrateKBPS][latencyTIME][overheadBYTES][linklayerTYPE]
Werecognizeburst,rate,mtuandpeakratethatwediscussedabove.Thelimitparameteristhesizeofthepacketqueue(seediagramabove). latencyrepresentsanotherwayofcalculatingthelimitparameter,bysettingthemaximumtimeapacketcanwaitinthequeue(thesizeofthequeueis thenderivatedfromit,thecalculationincludesallofthevaluesofburst,rate,peakrateandmtu).overheadandlinklayeraretwootherparameters whosestoryisquiteinteresting.Let'stakealookatthosenow. 4.2.1.1DSLandATM,theOxthatbelievedtobeFrog IfyouhaveeverreadJeanDeLaFontaine[http://en.wikipedia.org/wiki/Jean_de_La_Fontaine],youprobablyknowthestoryoftheThefrogthatwantedto beasbigasanox.Well,inourcase,it'stheopposite,andyourpacketsmightnotbeassmallastheythinktheyare. IfmostlocalnetworksuseEthernet,upuntilrecentlyalotofcommunicationsinside(atleastinEurope)weredoneovertheATMprotocol.Nowadays, ISParemovingtowardingalloverip,butATMisstillaround.TheparticularityofATMistosplitlargeethernetpacketsintomuchsmallerones,called cells.A1500bytesethernetpacketwouldbesplitinto~30smallerATMcellsofjust53byteseach.Andfromthose53bytes,only48arefromthe originalpacket,therestisoccupiedbytheATMheaders. Sowhereistheproblem?Consideringthefollowingnetworktopology.
TheQoSboxisinchargeofperformingthepacketschedulingbeforetransmittingittothemodem.ThepacketsarethensplitbythemodemintoATM cells.Soourinitial1.5KBethernetpacketsissplitinto32ATMcells,foratotalsizeof32*5bytesofheaderspercell+1500bytesofdata= (32*5)+1500=1660bytes.1660bytesis10.6%biggerthan1500.WhenATMisused,welose10%ofbandwidthcomparedtoanethernetnetwork(this isanestimatethatdependontheaveragepacketsize,etc). IfTBFdoesn'tknowaboutthat,andcalculatesitsratebasedonthesoleknowledgeoftheethernetMTU,thenitwilltransmit10%morepacketsthan themodemcantransmit.Themodemwillstartqueuing,andeventuallydropping,packets.TheTCPstackswillhavetoadjusttheirspeed,trafficgets erraticandwelosethebenefitofTCasatrafficshaper. JesperDangaardBrouerdidhisMasterThesisonthistopic,andwroteafewpatchsforthekernelandTC.Thesepatchsimplementtheoverheadand linklayerparameter,andcanbeusedtoinformtheTBFschedulerofthetypeoflinktoaccountfor. *overheadrepresentsthequantityofbytesaddedbytheATMheaders,5bydefault*linklayerdefinesthetypeoflink,eitherethernetor{atm,adsl}. atmandadslarethesamethingandrepresenta5bytesheaderoverhead WecanusetheseparametertofinetunethecreationofaTBFschedulingdiscipline:
#tcqdiscadddeveth0roottbfrate1mbitburst10klatency30mslinklayeratm #tcsqdiscshowdeveth0 qdisctbf8005:rootrate1000Kbitburst10Kblat30.0ms Sent738bytes5pkt(dropped0,overlimits0requeues0) rate0bit0ppsbacklog0b0prequeues0
Andbysettingthelinklayertoatm,wetakeintoaccountanoverheadof5bytespercelllaterinthetransmission,thereforepreventingthemodemfrom bufferingthepackets. 4.2.1.2ThelimitsofTBF TBFgivesaprettyaccuratecontroloverthebandwidthassignedtoaqdisc.Butitalsoimposesthatallpacketspassthroughasinglequeue.Ifabig packetisblockedbecausethereisnotenoughtokenstosendit,smallerpacketsthatcouldpotentiallybesentinsteadareblockedbehinditaswell. Thisisthecaserepresentedinthediagramabove,wherepacket#2isstuckbehindpacket#1.Wecouldoptimizethebandwidthusagebyallowingthe smallerpackettobesentinsteadofthebiggerone.Wewould,however,fallintothesameproblemofreorderingpacketsthatwediscussedwiththe SFQalgorithm. Theothersolutionwouldbetogivemoreflexibilitytothetrafficshaper,declareseveralTBFqueuesinparallel,androutethepacketstooneortheother usingfilters.Wecouldalsoallowthoseparallelqueuestoborrowtokensfromeachother,incaseoneisidleandtheotheroneisnot. Wejustpreparedthegroundforclassfulqdisc,andtheHierarchicalTokenBucket.
4.2.2HTBHierarchicalTokenBucket
TheHierarchicalTokenBucket(HTB)isanimprovedversionofTBFthatintroducesthenotionofclasses.Eachclassis,infact,aTBFlikeqdisc,and classesarelinkedtogetherasatree,witharootandleaves.HTBintroducesanumberoffeaturestoimprovedthemanagementofbandwidth,suchasa theprioritybetweenclasses,awaytoborrowbandwidthfromanotherclass,orthepossibilitytopluganotherqdiscasanexitpoint(aSFQforexample). Let'stakeasimpleexample,representedinthediagrambelow.
htb_en.dia.zip Thetreeiscreatedwiththecommandsbelow:
#tcqdiscadddeveth0roothandle1:htbdefault20 #tcclassadddeveth0parent1:0classid1:10htbrate200kbitceil400kbitprio1mtu1500 #tcclassadddeveth0parent1:0classid1:20htbrate200kbitceil200kbitprio2mtu1500
HTBusesasimilarterminologythanTBFandSFQ: burstisidenticaltotheburstofTBF:it'sthesizeofthetokenbucket rateisthespeedatwhichtokenatgeneratedandputinthebucket,thespeedoftheleaf,likeinTBF quantumissimilartothequantumdiscussedinSFQ,it'stheamountofbytestoservefromtheleafatonce. Thenewparametersareceilandcburst.Letuswalkthroughthetreetoseehowtheywork. Intheexampleabove,wehavearootqdischandle1andtwoleavesqdischandle10andqdischandle20.Therootwillapplyfilterstodecidewhere apacketshouldbedirected,wewilldiscussthoselater,andbydefaultpacketsaresenttoleaf#20(defaultto20). Theleaf#10hasaratevalueof200kbits/s,aceilvalueofof400kbibts/s(whichmeansitcanborrow200kbits/smorethatitsrate)andapriority(prio)of 1. Theleaf#20hasaratevalueof200kbits/s,aceilvalueof200kbbits/s(whichmeansitcannotborrowanything,rate==ceil)andapriorityof2. EachHTBleafwill,atanypoint,haveoneofthethreefollowingstatus: HTB_CANT_SEND:theclasscanneithersendnorborrow,nopacketsareallowedtoleavetheclass HTB_MAY_BORROW:theclasscannotsendusingitsowntokens,butcantrytoborrowfromanotherclass HTB_CAN_SEND:theclasscansendusingitsowntokens ImagineagroupofpacketsthatenterTCandaremarkedwiththeflag#10,andthereforearedirectedtoleaf#10.Thebucketforleaf#10doesnot containenoughtokenstoletthefirstpacketspass,soitwilltrytoborrowsomefromitsneighborleaf#20.Thequantumvalueofleaf#10issettothe MTU(1500bytes),whichmeansthemaximalamountofdatathatleaf#10willtrytosendis1500bytes.Ifpacket#1is1400byteslarge,andthebucket inleaf#10hasenoughtokensfor1000bytes,thentheleafwilltrytoborrowtheremaining400bytesfromitsneightborleaf#20. Thequantumisthemaximalamountofbytesthataleafwilltrytosendatonce.ThecloserthevalueisfromtheMTU,themoreaccuratethescheduling willbe,becausewerescheduleafterevery1500bytes.Andthelargerthevalueofquantumwillbe,themorealeafwillbeprivileged:itwillbeallowedto borrowmoretokensfromitsneighbor.Butofcourse,sincethetotalamountoftokensinthetreeisnotunlimited,ifatokenisborrowedfromaleaf, anotherleafcannotuseitanymore.Therefore,thebiggerthevalueofquantumis,themorealeafisabletostealfromitsneighbor.Thisistricky becausethoseneighborsmightverywellhavepacketstosendaswell. WhenconfiguringTC,wedonotmanipulatethevalueofquantumdirectly.Thereisanintermediaryparametercalledr2qthatcalculatesthequantum automaticallybasedontherate.quantum=rate/r2q.Bydefault,r2qissetto10,soforarateof200kbits,quantumwillhaveavalueof20kbits. Forverysmallorverylargebandwidth,itisimportanttotuner2qproperly.Ifr2qistoolarge,toomanypacketswillleaveaqueueatonce.Ifr2qistoo small,noenoughpacketsaresent. Oneimportantdetailisthatr2qissetontherootqdisconceandforall.Itcannotbeconfiguredforeachleafseparately. TCofferthefollowingconfigurationoptionsforHTB:
Usage:...qdiscadd...htb[defaultN][r2qN] defaultminoridofclasstowhichunclassifiedpacketsaresent{0} r2qDRRquantumsarecomputedasrateinBps/r2q{10} debugstringof16numberseach03{0} ...classadd...htbrateR1[burstB1][mpuB][overheadO] [prioP][slotS][pslotPS] [ceilR2][cburstB2][mtuMTU][quantumQ] raterateallocatedtothisclass(classcanstillborrow) burstmaxbytesburstwhichcanbeaccumulatedduringidleperiod{computed} mpuminimumpacketsizeusedinratecomputations overheadperpacketsizeoverheadusedinratecomputations linklayadaptingtoalinklayere.g.atm ceildefiniteupperclassrate(noborrows){rate} cburstburstbutforceil{computed} mtumaxpacketsizewecreateratemapfor{1600} priopriorityofleaflowerareservedfirst{0} quantumhowmuchbytestoservefromleafatonce{user2q}
Asyoucansee,wearenowfamiliartoallofthoseparameters.IfyoujustjumpedtothissectionwithoutreadingaboutSFQandTBF,pleasereadthose chaptersforadetailedexplanationofwhatthoseparametersdo. Rememberthat,whenconfiguringleavesinHTB,thesumoftherateoftheleavescannotbehigherthantherateoftheroot.Itmakessense,right? 4.2.2.1HysteresisandHTB Hysteresis.Ifthisbarbarianwordisnotfamiliartoyou,asitwasn'ttome,hereishowwikipediadefinesit:Hysteresisisthedependenceofasystem notjustonitscurrentenvironmentbutalsoonitspast. HysteresisisasideeffectintroducedbyanoptimizationofHTB.InordertoreducetheloadontheCPU,HTBinitiallydidnotrecalculatethecontentof thebucketoftenenough,thereforeallowingsomeclassestoconsumemoretokensthattheyactuallyheld,withoutborrowing. TheproblemwascorrectedandaparameterintroducedtoalloworblocktheusageofestimateinHTBcalculation.Thekerneldeveloperskeptthe optimizationfeaturesimplybecauseitcanproveusefulinhightrafficnetworks,whererecalculatingthecontentofthebucketeachtimeissimplynot doable. Butinmostcases,thisoptimizationissimplydeactivated,asshownbelow:
#cat/sys/module/sch_htb/parameters/htb_hysteresis 0
4.2.3HFSCHierarchicalFairServiceCurve
http://nbd.name/gitweb.cgi?p=openwrt.gita=treef=package/qosscripts/filesh=71d89f8ad63b0dda0585172ef01f77c81970c8cchb=HEAD
[http://nbd.name/gitweb.cgi?p=openwrt.gita=treef=package/qosscripts/filesh=71d89f8ad63b0dda0585172ef01f77c81970c8cchb=HEAD]
4.2.4QFQQuickFairQueueing
http://www.youtube.com/watch?v=r8vBmybeKlE[http://www.youtube.com/watch?v=r8vBmybeKlE]
4.2.5REDRandomEarlyDetection
4.2.6CHOKeCHOoseand{Keep,Kill}
5ShapingthetrafficontheHomeNetwork
Homenetworksaretrickytoshape,becauseeverybodywantsthepriorityandit'sdifficulttopredetermineausagepattern.Inthischapter,wewillbuild aTCpolicythatanswergeneralneeds.Thoseare: Lowlatency.Theuplinkisonly1.5Mbpsandthelatencyshouldn'tbemorethan30msunderhighload.Wecantunethebuffersintheqdiscto ensurethatourpacketswillnotstayinalargequeuefor500mswaitingtobeprocessed HighUDPresponsiveness,forapplicationslikeSkypeandDNSqueries GuarantiedHTTP/sbandwidth,halfoftheuplinkisdedicatedtotheHTTPtraffic(although,otherclassescanborrowfromit)toensurethatweb browsing,probably80%ofahomenetworkusage,issmoothandresponsive TCPACKsandSSHtrafficgethigherpriority.IntheageofNetflixandHDVoD,it'snecessarytoensurefastdownloadspeed.Andforthat,you needtobeabletosendTCPACKsasfastaspossible.ThisiswhythosepacketsgetahigherprioritythantheHTTPtraffic. Ageneralclassforeverythingelse. Thispolicyisrepresentedinthediagrambelow.WewillusePFIFO_FASTandSFQterminationqdisconceweexitHTBtoperformsomeadditional scheduling(andpreventasingleHTTPconnectionfromeatingallofthebandwidth,forexample).
DIAsource Thescriptthatgeneratesthispolicyisavailableongithubviatheiconbelow,withcommentstohelpyoufollowthrough.
getthebashscriptfromgithub Belowisoneofthesection,inchargeofthecreationoftheclassforSSH.Ihavereplacedthevariableswiththeirvalueforreadability.
#SSHclass:foroutgoingconnectionsto #avoidlagwhensomebodyelseisdownloading #however,anSSHconnectioncannotfillup #theconnectiontomorethan70% echo"#sshid300rate160kbitceil1120kbit" /sbin/tcclassadddeveth0parent1:1classid1:300htb\ rate160kbitceil1120kbitburst15kprio3 #SFQwillmixthepacketsifthereareseveral #SSHconnectionsinparallel #andensurethatnonehasthepriority echo"#~subssh:sfq" /sbin/tcqdiscadddeveth0parent1:300handle1300:\ sfqperturb10limit32 echo"#~sshfilter" /sbin/tcfilteradddeveth0parent1:0protocolip\ prio3handle300fwflowid1:300 echo"#~netfilterruleSSHat300" /sbin/iptablestmangleAPOSTROUTINGoeth0ptcp tcpflagsSYNSYNdport22jCONNMARK\ setmark300
ThefirstruleisthedefinitionoftheHTBclass,theleaf.Iconnectsbacktoitsparent1:1,definesarateof160kbit/sandcanuseupto1120kbit/sby
borrowingthedifferencefromotherleaves.Theburstvalueissetto15k,withis10fullpacketswithaMTUof1500bytes. ThesecondruledefinesaSFQqdiscconnectedtotheHTBoneabove.ThatmeansthatoncepacketshavepassedtheHTBleaf,theywillpassthrough aSFQleafbeforebeingtransmitted.TheSFQwillensurethatmultipleparallelconnectionsaremixedbeforebeingtransmitted,andthatoneconnection cannoteatthewholebandwidth.WelimitthesizeoftheSFQqueueto32packets,insteadofthedefaultof128. ThecometheTCfilterinthethirdrule.Thisfilterwillcheckthehandleofeachpacket,or,tobemoreaccurate,thevalueofnf_markinthesk_buff representationofthepacketinthekernel.Usingthismark,thefilterwilldirectSSHpackettotheHTBleafabove. EventhoughthisruleislocatedintheSSHclassblockforclarity,youmighthavenoticedthatthefilterhastherootqdiscforparent(parent1:0).Filters arealwaysattachedtotherootqdisc,andnottotheleaves.Thatmakessense,becausethefilteringmustbedoneattheentranceofthetrafficcontrol layer. Andfinally,thefourthruleistheiptablesrulethatappliesamarktoSYNpacketsleavingthegateway(connectionestablishments).WhySYNpackets only?Toavoidperformingcomplexmatchingonallthepacketsofalltheconnection.Wewillrelyonnetfilter'scapabilitytomaintainstatefulinformation topropagateamarkplacedonthefirstpacketoftheconnectiontoalloftheotherpackets.Thisisdonebythefollowingruleattheendofthescript:
echo"#~propagatingmarksonconnections" iptablestmangleAPOSTROUTINGjCONNMARKrestoremark
Letusnowloadthescriptonourgateway,andvisualisetheqdisccreated.
#/etc/network/ifup.d/lnw_gateway_tc.shstart ~~~~LOADINGeth0TRAFFICCONTROLRULESFORramiel~~~~ #cleanup RTNETLINKanswers:Nosuchfileordirectory #defineaHTBrootqdisc #uplinkrate1600kbitceil1600kbit #interactiveid100rate160kbitceil1600kbit #~subinteractive:pfifo #~interactivefilter #~netfilterruleallUDPtrafficat100 #tcpacksid200rate320kbitceil1600kbit #~subtcpacks:pfifo #~filtretcpacks #~netfilterruleforTCPACKswillbeloadedattheend #sshid300rate160kbitceil1120kbit #~subssh:sfq #~sshfilter #~netfilterruleSSHat300 #httpbranchid400rate800kbitceil1600kbit #~subhttpbranch:sfq #~httpbranchfilter #~netfilterrulehttp/s #defaultid999rate160kbitceil1600kbit #~subdefault:sfq #~filtredefault #~propagatingmarksonconnections #~MarkTCPACKsflagsat200 TrafficControlisupandrunning #/etc/network/ifup.d/lnw_gateway_tc.shshow qdiscsdetails qdischtb1:rootrefcnt2r2q40default999direct_packets_stat0ver3.17 qdiscpfifo1100:parent1:100limit10p qdiscpfifo1200:parent1:200limit10p qdiscsfq1300:parent1:300limit32pquantum1514bflows32/1024perturb10sec qdiscsfq1400:parent1:400limit32pquantum1514bflows32/1024perturb10sec qdiscsfq1999:parent1:999limit32pquantum1514bflows32/1024perturb10sec qdiscsstatistics qdischtb1:rootrefcnt2r2q40default999direct_packets_stat0 Sent16776950bytes125321pkt(dropped4813,overlimits28190requeues0) rate0bit0ppsbacklog0b0prequeues0 qdiscpfifo1100:parent1:100limit10p Sent180664bytes1985pkt(dropped0,overlimits0requeues0) rate0bit0ppsbacklog0b0prequeues0 qdiscpfifo1200:parent1:200limit10p Sent5607402bytes100899pkt(dropped4813,overlimits0requeues0) rate0bit0ppsbacklog0b0prequeues0 qdiscsfq1300:parent1:300limit32pquantum1514bperturb10sec Sent0bytes0pkt(dropped0,overlimits0requeues0) rate0bit0ppsbacklog0b0prequeues0 qdiscsfq1400:parent1:400limit32pquantum1514bperturb10sec Sent9790497bytes15682pkt(dropped0,overlimits0requeues0) rate0bit0ppsbacklog0b0prequeues0 qdiscsfq1999:parent1:999limit32pquantum1514bperturb10sec Sent1198387bytes6755pkt(dropped0,overlimits0requeues0) rate0bit0ppsbacklog0b0prequeues0
Theoutputbelowisjusttwotypesofoutputtccangenerate.Youmightfindtheclassstatisticstobehelpfultodiagnoseleavesconsumption:
#tcsclassshowdeveth0 [...truncated...] classhtb1:400parent1:1leaf1400:prio4rate800000bitceil1600Kbitburst30Kbcburst1600b Sent10290035bytes16426pkt(dropped0,overlimits0requeues0) rate23624bit5ppsbacklog0b0prequeues0 lended:16424borrowed:2giants:0 tokens:4791250ctokens:120625
AboveisshownthedetailledstatisticsfortheHTTPleaf,andyoucanseetheaccumulatedrate,statisticsofpacketsperseconds,butalsothetokens accumulated,lended,borrowed,etcthisisthemosthelpfuloutputtodiagnoseyourpolicyindepth.
6Awordabout"BufferBloat"
Wementionnedthattoolargebufferscanhaveanegativeimpactontheperformancesofaconnection.Buthowbadisitexactly? TheanswertothatquestionwasinvestigatedbyJimGettys[http://gettys.wordpress.com/bufferbloatfaq/]whenhefoundhishomenetworktobe inexplicablyslow. Hefoundthat,whilewewereincreasingthebandwidthofnetworkconnections,wedidn'tworryaboutthelatencyatall.Thosetwofactorsarequite differentandbothcriticaltothegoodqualityofanetwork.AllowmetoquoteGettys'sFAQhere:
A100Gigabitnetworkisalwaysfasterthana 1megabitnetwork,isntit? Morebandwidthisalwaysbetter!Iwantafasternetwork! No,suchanetworkcaneasilybemuchslower. Bandwidthisameasureofcapacity,notameasureofhow fastthenetworkcanrespond.Youpickupthephoneto sendamessagetoShanghaiimmediately,butdispatchinga cargoshipfullofbluraydiskswillbeamazinglyslower thanthetelephonecall,eventhoughthebandwidthofthe shipisbillionsandbillionsoftimeslargerthanthe telephoneline.Somorebandwidthisbetteronlyifits latency(speed)meetsyourneeds.Moreofwhatyoudont needisuseless. Bufferbloatdestroysthespeedwereallyneed.
Ononemachine,launchnttcpwiththe'i'switchtomakeitwaitforconnections:
#nttcpi
nisthenumberofbuffersof4096bytesgiventothesocket.
#nttcptDn2048000192.168.1.220
Andatthesametime,onthelaptop,launchapingofthedesktop.
64bytesfrom192.168.1.220:icmp_req=1ttl=64time=0.300ms 64bytesfrom192.168.1.220:icmp_req=2ttl=64time=0.386ms 64bytesfrom192.168.1.220:icmp_req=3ttl=64time=19.2ms 64bytesfrom192.168.1.220:icmp_req=4ttl=64time=19.2ms 64bytesfrom192.168.1.220:icmp_req=5ttl=64time=19.2ms 64bytesfrom192.168.1.220:icmp_req=6ttl=64time=19.2ms 64bytesfrom192.168.1.220:icmp_req=7ttl=64time=19.3ms 64bytesfrom192.168.1.220:icmp_req=8ttl=64time=19.0ms 64bytesfrom192.168.1.220:icmp_req=9ttl=64time=0.281ms 64bytesfrom192.168.1.220:icmp_req=10ttl=64time=0.362ms
Thefirsttwopingsarelaunchbeforenttcpislaunched.Whennttcpstarts,thelatencyaugmentsbutthisisstillacceptable. Now,reducethespeedofeachnetworkcardonthedesktopandthelaptopto100Mbips.Thecommandis:
Andrunthesametestagain.After60seconds,herearethelatencyIget:
64bytesfrom192.168.1.220:icmp_req=75ttl=64time=183ms 64bytesfrom192.168.1.220:icmp_req=76ttl=64time=179ms 64bytesfrom192.168.1.220:icmp_req=77ttl=64time=181ms
Andonelasttime,withanEthernetspeedof10Mbps:
64bytesfrom192.168.1.220:icmp_req=187ttl=64time=940ms 64bytesfrom192.168.1.220:icmp_req=188ttl=64time=937ms 64bytesfrom192.168.1.220:icmp_req=189ttl=64time=934ms
Westartbychangingthetxqueuelenvalueonthelaptopmachinefrom1000tozero.Thelatencywillnotchange.
#ifconfigeth0txqueuelen0 64bytesfrom192.168.1.220:icmp_req=1460ttl=64time=970ms 64bytesfrom192.168.1.220:icmp_req=1461ttl=64time=967ms
Thenwereducethesizeofthetxringoftheethernetcard.Nowthatwedon'thaveanybufferanymore,let'sseewhathappens:
#ethtoolGeth0tx32 64bytesfrom192.168.1.220:icmp_req=1495ttl=64time=937ms 64bytesfrom192.168.1.220:icmp_req=1499ttl=64time=0.865ms 64bytesfrom192.168.1.220:icmp_req=1500ttl=64time=60.3ms 64bytesfrom192.168.1.220:icmp_req=1501ttl=64time=53.1ms 64bytesfrom192.168.1.220:icmp_req=1502ttl=64time=49.2ms 64bytesfrom192.168.1.220:icmp_req=1503ttl=64time=45.7ms
Thelatencyjustgotdividedby20!Wedroppedfromalmostonesecondtobarely50ms.Thisistheeffectofexcessivebufferinginanetwork,andthis iswhathappens,today,inmostInternetrouters.
6.1Whathappensinthebuffer?
IfwetakealookattheLinuxnetworkingstack,weseethattheTCPstackisalotabovethetransmitqueueandethernetbuffer.DuringanormalTCP connection,theTCPstackstartssendingandreceivingpacketsatanormalrate,andacceleratesitssendingspeedatanexponentialrate:send2 packets,receiveACKs,send4packets,receiveACKs,send8packets,receivesACKs,send16packets,receivesACKS,etc. ThisisknownastheTCPSlowStart[http://tools.ietf.org/html/rfc5681].Thismechanismsworksfineinpractice,butthepresenceoflargebufferswillbreak it. Abufferof1MBona1Gbits/slinkwillemptyin~8milliseconds.Butthesamebufferona1MBits/slinkwilltake8secondstoempty.Duringthose8 seconds,theTCPstackthinksthatallofthepacketsitsenthavebeentransmitted,andwillprobablycontinuetoincreaseitssendingspeed.The subsequentpacketswillgetdropped,theTCPstackwillpanick,dropitssendingrate,andrestarttheslowstartprocedurefrom0:2packets,getack,4 packets,getack,etc ButwhiletheTCPstackwasfillinguptheTXbuffers,alltheotherpacketsthatoursystemwantedtosendgoteitherstucksomewhereinthequeue, withseveralhundredsofmillisecondsofdelaybeforebeingtransmitted,orpurelydropped. TheproblemhappensontheTXqueueofthesendingmachine,butalsoonallthethebuffersoftheintermediarynetworkdevices.Andthisiswhy Gettyswenttowaragainstthehomeroutervendors.
Discussion
PaulBixel,2011/12/0800:46 ThisisaninterestingarticleandIamespeciallyinterestedinyoudiscussionofthelinklayeratmoptionnowsupportedbyTC.Thiskindof explainationisneededbecausethereissolittlewrittinaboutthelinklayeroptionandthepropersettingsformtu/mpu/tsize&overhead. Inyourdiscussionyoumentiontheoverheadparameterisdefaultedto5anditisimpliedthatitisnotnecessarytherefortospecifyitwhenatmis used.Butaccordingtohttp://acehost.stuart.id.au/russell/files/tc/tcatm[http://acehost.stuart.id.au/russell/files/tc/tcatm]theoverheadparametersis