Reinforcement Learning Through The Optimization Lens

Transcription

reinforcement learning throughthe optimization lensBenjamin RechtUniversity of California, Berkeley

trustable, scalable, predictable

!Control TheoryReinforcement Learning is the study of how to use past datato enhance the future manipulation of a dynamical system

Disciplinary BiasesAE/CE/EE/MECSControl IEEE TransactionsControldiscretedataactionScience Magazine

Disciplinary BiasesAE/CE/EE/MECSReinforcementToday’stalk Theorywill try to unify these camps and pointControlLearningout how to merge their perspectives.RLcontinuousmodelactionIEEE TransactionsControldiscretedataactionScience Magazine

Main research challenge: What are thefundamental limits of learning systems thatinteract with the physical environment?How well must we understand a systemin order to control it? statistical learning theorytheoreticalfoundations robust control theory core optimization

Control theory is the study of dynamical systems with inputsyGuxtxt 1 Axt Butyt Cxt DutSimplest case of such systems are linear systemsxt is called the state, and the dimension of thestate is called the degree, d.ut is called the input, and the dimension is p.yt is called the output, and the dimension is q.For today, will only consider C I, D 0 (xt observed)

ReinforcementLearningdiscreteControl theory is the study of dynamical systems with inputs yGuxtp(xt 1 past) p(xt 1 xt , ut )p(yt past) p(yt xt , ut )Simplest example: Partially Observed MarkovDecision Process (POMDP)xt is the state, and it takes values in [d]ut is called the input, and takes values in [p].yt is called the output, and takes values in [q].For today, will only consider when xt observed (MDP).

sha1 base64 "DOx/ktybitgjChwuZWtodyh8jiA " QqJEu4O7lJLs7ObmfuFsPi7 hr/Vn 5K5nerKfWe G5Zh9m1FC6twoTXxkCjx0s1pFvqe1C YlvsiIaIxHdxJE7FuWgJJf6If K/eAyCoBF8C6KX0KA2ydkSUxb8fAb8qMVJ /latexit sha1 base64 "DOx/ktybitgjChwuZWtodyh8jiA " QqJEu4O7lJLs7ObmfuFsPi7 hr/Vn 5K5nerKfWe G5Zh9m1FC6twoTXxkCjx0s1pFvqe1C YlvsiIaIxHdxJE7FuWgJJf6If K/eAyCoBF8C6KX0KA2ydkSUxb8fAb8qMVJ /latexit latexitsha1 base64 "DOx/ktybitgjChwuZWtodyh8jiA " QqJEu4O7lJLs7ObmfuFsPi7 hr/Vn 5K5nerKfWe G5Zh9m1FC6twoTXxkCjx0s1pFvqe1C YlvsiIaIxHdxJE7FuWgJJf6If K/eAyCoBF8C6KX0KA2ydkSUxb8fAb8qMVJ /latexit latexitsha1 base64 "DOx/ktybitgjChwuZWtodyh8jiA " QqJEu4O7lJLs7ObmfuFsPi7 hr/Vn 5K5nerKfWe G5Zh9m1FC6twoTXxkCjx0s1pFvqe1C YlvsiIaIxHdxJE7FuWgJJf6If K/eAyCoBF8C6KX0KA2ydkSUxb8fAb8qMVJ /latexit latexit latexitsha1 base64 "oTGOPnlC3lpbuJxkZHAlqk3gehs " Z/JKSUdx/KcV3bp95 69vfv7Dx4 evykffD03BlvBQ6FUcZe5uBQSY1DkqTwsrIIZa7wIr961 Ta SZW exTfUPKPDrN2J /Eq P9pI0 r0rFBub1HOvpTAFbrGK5mQhkA6pBKmbreoPUin FbTjZ3I6oxs1tG3k7ns5lcH9s3AufbiTHA6SbNu/C86P kncTz6/7py uaZ2wjotFfdanQ g /latexit sha1 base64 "oTGOPnlC3lpbuJxkZHAlqk3gehs " Z/JKSUdx/KcV3bp95 69vfv7Dx4 evykffD03BlvBQ6FUcZe5uBQSY1DkqTwsrIIZa7wIr961 Ta SZW exTfUPKPDrN2J /Eq P9pI0 r0rFBub1HOvpTAFbrGK5mQhkA6pBKmbreoPUin FbTjZ3I6oxs1tG3k7ns5lcH9s3AufbiTHA6SbNu/C86P kncTz6/7py uaZ2wjotFfdanQ g /latexit latexitsha1 base64 "oTGOPnlC3lpbuJxkZHAlqk3gehs " Z/JKSUdx/KcV3bp95 69vfv7Dx4 evykffD03BlvBQ6FUcZe5uBQSY1DkqTwsrIIZa7wIr961 Ta SZW exTfUPKPDrN2J /Eq P9pI0 r0rFBub1HOvpTAFbrGK5mQhkA6pBKmbreoPUin FbTjZ3I6oxs1tG3k7ns5lcH9s3AufbiTHA6SbNu/C86P kncTz6/7py uaZ2wjotFfdanQ g /latexit latexitsha1 base64 "oTGOPnlC3lpbuJxkZHAlqk3gehs " Z/JKSUdx/KcV3bp95 69vfv7Dx4 evykffD03BlvBQ6FUcZe5uBQSY1DkqTwsrIIZa7wIr961 Ta SZW exTfUPKPDrN2J /Eq P9pI0 r0rFBub1HOvpTAFbrGK5mQhkA6pBKmbreoPUin FbTjZ3I6oxs1tG3k7ns5lcH9s3AufbiTHA6SbNu/C86P kncTz6/7py uaZ2wjotFfdanQ g /latexit latexit latexitsha1 base64 "swBJUJK950UX4Q9MabdzRZVkgIc " AAACjXicbVFdSxtBFJ2s9aOxNdG FPoyNFgUS9gVRR/aYtOCffBB0agQl Xu5CYZnJ1dZu6WhCX9NX1t/0//TWdjBJN4YeBwzv2Ye0 cKWnJ9/9VvKUXyyuray /Wbn2isTPUVjTIME hr2ZMCyFFR/e0wKmgvGPMPn/lXPoyI7/EWzyOK6g2/6U Ic 7mozUya9MxQzmxTDXEuRdnGOVTQkA460SAlIXW5VnEql CVoy89kf0CPqmtbyjvfZV S/XjmrNG7C8nOkGD /Ivger8Z M3g4qBx0ppas8besfdshwXsiJ2wH yctZlgv9hv9of99WreoffJ /KQ6lWmNW/YTHin/wHe3MgZ /latexit sha1 base64 "swBJUJK950UX4Q9MabdzRZVkgIc " AAACjXicbVFdSxtBFJ2s9aOxNdG FPoyNFgUS9gVRR/aYtOCffBB0agQl Xu5CYZnJ1dZu6WhCX9NX1t/0//TWdjBJN4YeBwzv2Ye0 cKWnJ9/9VvKUXyyuray /Wbn2isTPUVjTIME hr2ZMCyFFR/e0wKmgvGPMPn/lXPoyI7/EWzyOK6g2/6U Ic 7mozUya9MxQzmxTDXEuRdnGOVTQkA460SAlIXW5VnEql CVoy89kf0CPqmtbyjvfZV S/XjmrNG7C8nOkGD /Ivger8Z M3g4qBx0ppas8besfdshwXsiJ2wH yctZlgv9hv9of99WreoffJ /KQ6lWmNW/YTHin/wHe3MgZ /latexit latexitsha1 base64 "swBJUJK950UX4Q9MabdzRZVkgIc " AAACjXicbVFdSxtBFJ2s9aOxNdG FPoyNFgUS9gVRR/aYtOCffBB0agQl Xu5CYZnJ1dZu6WhCX9NX1t/0//TWdjBJN4YeBwzv2Ye0 cKWnJ9/9VvKUXyyuray /Wbn2isTPUVjTIME hr2ZMCyFFR/e0wKmgvGPMPn/lXPoyI7/EWzyOK6g2/6U Ic 7mozUya9MxQzmxTDXEuRdnGOVTQkA460SAlIXW5VnEql CVoy89kf0CPqmtbyjvfZV S/XjmrNG7C8nOkGD /Ivger8Z M3g4qBx0ppas8besfdshwXsiJ2wH yctZlgv9hv9of99WreoffJ /KQ6lWmNW/YTHin/wHe3MgZ /latexit latexitsha1 base64 "swBJUJK950UX4Q9MabdzRZVkgIc " AAACjXicbVFdSxtBFJ2s9aOxNdG FPoyNFgUS9gVRR/aYtOCffBB0agQl Xu5CYZnJ1dZu6WhCX9NX1t/0//TWdjBJN4YeBwzv2Ye0 cKWnJ9/9VvKUXyyuray /Wbn2isTPUVjTIME hr2ZMCyFFR/e0wKmgvGPMPn/lXPoyI7/EWzyOK6g2/6U Ic 7mozUya9MxQzmxTDXEuRdnGOVTQkA460SAlIXW5VnEql CVoy89kf0CPqmtbyjvfZV S/XjmrNG7C8nOkGD /Ivger8Z M3g4qBx0ppas8besfdshwXsiJ2wH yctZlgv9hv9of99WreoffJ /KQ6lWmNW/YTHin/wHe3MgZ /latexit latexit latexitx Kxtsha1 base64 " 5yNEZhVzC7GiNHA9QeX XHLpxg " AAACiHicbVFdSxtBFJ1sbavph7E jIYChFK2BXB iCICu2DD4qNCsmy3J3cJBdnZ5eZu8Ww Fv6qj p/6azMYUm6YWBM fc75sWmhyH4e9G8Grt9Zu36xvNd n2gd5eYHTwuMMxgbGpEC9lTS2i4TlsdyUFDCnQGD/ 2ByUDgtQ9zDGvocGMnRxNev UX72zFCOcuufYTlj/42oIHNumqXeMwOeuGWtJv n9UsefY0rMkXJaNRLoVGpJeeyXoUckkXFeuoBKEu V6kmYEGxX9hClVnuAtXCJNVDaUjlQ1xiNT Yil/iSTwHzSAMDoOjF9egMY/ZFgsWnP4BKmTHew /latexit sha1 base64 " 5yNEZhVzC7GiNHA9QeX XHLpxg " AAACiHicbVFdSxtBFJ1sbavph7E jIYChFK2BXB iCICu2DD4qNCsmy3J3cJBdnZ5eZu8Ww Fv6qj p/6azMYUm6YWBM fc75sWmhyH4e9G8Grt9Zu36xvNd n2gd5eYHTwuMMxgbGpEC9lTS2i4TlsdyUFDCnQGD/ 2ByUDgtQ9zDGvocGMnRxNev UX72zFCOcuufYTlj/42oIHNumqXeMwOeuGWtJv n9UsefY0rMkXJaNRLoVGpJeeyXoUckkXFeuoBKEu V6kmYEGxX9hClVnuAtXCJNVDaUjlQ1xiNT Yil/iSTwHzSAMDoOjF9egMY/ZFgsWnP4BKmTHew /latexit latexitsha1 base64 " 5yNEZhVzC7GiNHA9QeX XHLpxg " AAACiHicbVFdSxtBFJ1sbavph7E jIYChFK2BXB iCICu2DD4qNCsmy3J3cJBdnZ5eZu8Ww Fv6qj p/6azMYUm6YWBM fc75sWmhyH4e9G8Grt9Zu36xvNd n2gd5eYHTwuMMxgbGpEC9lTS2i4TlsdyUFDCnQGD/ 2ByUDgtQ9zDGvocGMnRxNev UX72zFCOcuufYTlj/42oIHNumqXeMwOeuGWtJv n9UsefY0rMkXJaNRLoVGpJeeyXoUckkXFeuoBKEu V6kmYEGxX9hClVnuAtXCJNVDaUjlQ1xiNT Yil/iSTwHzSAMDoOjF9egMY/ZFgsWnP4BKmTHew /latexit latexitxt 1 Axt Butut t ( t )sha1 base64 " 5yNEZhVzC7GiNHA9QeX XHLpxg " AAACiHicbVFdSxtBFJ1sbavph7E jIYChFK2BXB iCICu2DD4qNCsmy3J3cJBdnZ5eZu8Ww Fv6qj p/6azMYUm6YWBM fc75sWmhyH4e9G8Grt9Zu36xvNd n2gd5eYHTwuMMxgbGpEC9lTS2i4TlsdyUFDCnQGD/ 2ByUDgtQ9zDGvocGMnRxNev UX72zFCOcuufYTlj/42oIHNumqXeMwOeuGWtJv n9UsefY0rMkXJaNRLoVGpJeeyXoUckkXFeuoBKEu V6kmYEGxX9hClVnuAtXCJNVDaUjlQ1xiNT Yil/iSTwHzSAMDoOjF9egMY/ZFgsWnP4BKmTHew /latexit latexitG latexitController DesignuA dynamical system is connected in feedback with a controllerthat tries to get the closed loop to behave.Actions decided based on observed trajectories t ( u 1 , . . . , u t 1 , x 0 , . . . , x t )A mapping from trajectory to action is called a policy, t ( t )Optimal control: find policy that minimizes some objective.

sha1 base64 "DOx/ktybitgjChwuZWtodyh8jiA " QqJEu4O7lJLs7ObmfuFsPi7 hr/Vn 5K5nerKfWe G5Zh9m1FC6twoTXxkCjx0s1pFvqe1C YlvsiIaIxHdxJE7FuWgJJf6If K/eAyCoBF8C6KX0KA2ydkSUxb8fAb8qMVJ /latexit sha1 base64 "DOx/ktybitgjChwuZWtodyh8jiA " QqJEu4O7lJLs7ObmfuFsPi7 hr/Vn 5K5nerKfWe G5Zh9m1FC6twoTXxkCjx0s1pFvqe1C YlvsiIaIxHdxJE7FuWgJJf6If K/eAyCoBF8C6KX0KA2ydkSUxb8fAb8qMVJ /latexit latexitsha1 base64 "DOx/ktybitgjChwuZWtodyh8jiA " QqJEu4O7lJLs7ObmfuFsPi7 hr/Vn 5K5nerKfWe G5Zh9m1FC6twoTXxkCjx0s1pFvqe1C YlvsiIaIxHdxJE7FuWgJJf6If K/eAyCoBF8C6KX0KA2ydkSUxb8fAb8qMVJ /latexit latexitsha1 base64 "DOx/ktybitgjChwuZWtodyh8jiA " QqJEu4O7lJLs7ObmfuFsPi7 hr/Vn 5K5nerKfWe G5Zh9m1FC6twoTXxkCjx0s1pFvqe1C YlvsiIaIxHdxJE7FuWgJJf6If K/eAyCoBF8C6KX0KA2ydkSUxb8fAb8qMVJ /latexit latexit t ( t ) is the policy. This is the optimization decision variable. latexitsha1 base64 "oTGOPnlC3lpbuJxkZHAlqk3gehs " Z/JKSUdx/KcV3bp95 69vfv7Dx4 evykffD03BlvBQ6FUcZe5uBQSY1DkqTwsrIIZa7wIr961 Ta SZW exTfUPKPDrN2J /Eq P9pI0 r0rFBub1HOvpTAFbrGK5mQhkA6pBKmbreoPUin FbTjZ3I6oxs1tG3k7ns5lcH9s3AufbiTHA6SbNu/C86P kncTz6/7py uaZ2wjotFfdanQ g /latexit sha1 base64 "oTGOPnlC3lpbuJxkZHAlqk3gehs " Z/JKSUdx/KcV3bp95 69vfv7Dx4 evykffD03BlvBQ6FUcZe5uBQSY1DkqTwsrIIZa7wIr961 Ta SZW exTfUPKPDrN2J /Eq P9pI0 r0rFBub1HOvpTAFbrGK5mQhkA6pBKmbreoPUin FbTjZ3I6oxs1tG3k7ns5lcH9s3AufbiTHA6SbNu/C86P kncTz6/7py uaZ2wjotFfdanQ g /latexit latexitsha1 base64 "oTGOPnlC3lpbuJxkZHAlqk3gehs " Z/JKSUdx/KcV3bp95 69vfv7Dx4 evykffD03BlvBQ6FUcZe5uBQSY1DkqTwsrIIZa7wIr961 Ta SZW exTfUPKPDrN2J /Eq P9pI0 r0rFBub1HOvpTAFbrGK5mQhkA6pBKmbreoPUin FbTjZ3I6oxs1tG3k7ns5lcH9s3AufbiTHA6SbNu/C86P kncTz6/7py uaZ2wjotFfdanQ g /latexit latexitsha1 base64 "oTGOPnlC3lpbuJxkZHAlqk3gehs " Z/JKSUdx/KcV3bp95 69vfv7Dx4 evykffD03BlvBQ6FUcZe5uBQSY1DkqTwsrIIZa7wIr961 Ta SZW exTfUPKPDrN2J /Eq P9pI0 r0rFBub1HOvpTAFbrGK5mQhkA6pBKmbreoPUin FbTjZ3I6oxs1tG3k7ns5lcH9s3AufbiTHA6SbNu/C86P kncTz6/7py uaZ2wjotFfdanQ g /latexit latexit t ( u 1 , . . . , u t latexitsha1 base64 "Vs 14vGXEYCWQa4/aBIirWhHyZg " e/N2Z8ZpIYXFMPzlB1euXrt Y nuCgTHCPGjGdYUxpa2OHOGws5k79OKrJPsn zs JI4dMSJYb92kkS/TfioopaxcqdcpmMnada8D/ceMSs dxJXRRImh eVFWSoI5aZZDJsIAR7lwCeNGuLcSPmOGcXQrXLll6V0AX p9L/KeeQfeO /YG3nc3/Yj/4X/MvgSfAu Bz8upYHf1tz3ViL4 Rs0RP43 /latexit sha1 base64 "Vs 14vGXEYCWQa4/aBIirWhHyZg " e/N2Z8ZpIYXFMPzlB1euXrt Y nuCgTHCPGjGdYUxpa2OHOGws5k79OKrJPsn zs JI4dMSJYb92kkS/TfioopaxcqdcpmMnada8D/ceMSs dxJXRRImh eVFWSoI5aZZDJsIAR7lwCeNGuLcSPmOGcXQrXLll6V0AX p9L/KeeQfeO /YG3nc3/Yj/4X/MvgSfAu Bz8upYHf1tz3ViL4 Rs0RP43 /latexit latexitsha1 base64 "Vs 14vGXEYCWQa4/aBIirWhHyZg " e/N2Z8ZpIYXFMPzlB1euXrt Y nuCgTHCPGjGdYUxpa2OHOGws5k79OKrJPsn zs JI4dMSJYb92kkS/TfioopaxcqdcpmMnada8D/ceMSs dxJXRRImh eVFWSoI5aZZDJsIAR7lwCeNGuLcSPmOGcXQrXLll6V0AX p9L/KeeQfeO /YG3nc3/Yj/4X/MvgSfAu Bz8upYHf1tz3ViL4 Rs0RP43 /latexit latexitsha1 base64 "Vs 14vGXEYCWQa4/aBIirWhHyZg " e/N2Z8ZpIYXFMPzlB1euXrt Y nuCgTHCPGjGdYUxpa2OHOGws5k79OKrJPsn zs JI4dMSJYb92kkS/TfioopaxcqdcpmMnada8D/ceMSs dxJXRRImh eVFWSoI5aZZDJsIAR7lwCeNGuLcSPmOGcXQrXLll6V0AX p9L/KeeQfeO /YG3nc3/Yj/4X/MvgSfAu Bz8upYHf1tz3ViL4 Rs0RP43 /latexit latexit latexitOptimal controlminimizes.t.EeC t (x t , u t )xt 1 ft (xt , ut , et )ut t ( t )hPTt 11 , x0 , . . . , xt )ixuGxteCt is the cost. If you maximize, it’s called a reward.et is a noise processft is the state-transition functionis an observed trajectory

x sha1 base64 " 5yNEZhVzC7GiNHA9QeX XHLpxg " AAACiHicbVFdSxtBFJ1sbavph7E jIYChFK2BXB iCICu2DD4qNCsmy3J3cJBdnZ5eZu8Ww Fv6qj p/6azMYUm6YWBM fc75sWmhyH4e9G8Grt9Zu36xvNd n2gd5eYHTwuMMxgbGpEC9lTS2i4TlsdyUFDCnQGD/ 2ByUDgtQ9zDGvocGMnRxNev UX72zFCOcuufYTlj/42oIHNumqXeMwOeuGWtJv n9UsefY0rMkXJaNRLoVGpJeeyXoUckkXFeuoBKEu V6kmYEGxX9hClVnuAtXCJNVDaUjlQ1xiNT Yil/iSTwHzSAMDoOjF9egMY/ZFgsWnP4BKmTHew /latexit sha1 base64 " 5yNEZhVzC7GiNHA9QeX XHLpxg " AAACiHicbVFdSxtBFJ1sbavph7E jIYChFK2BXB iCICu2DD4qNCsmy3J3cJBdnZ5eZu8Ww Fv6qj p/6azMYUm6YWBM fc75sWmhyH4e9G8Grt9Zu36xvNd n2gd5eYHTwuMMxgbGpEC9lTS2i4TlsdyUFDCnQGD/ 2ByUDgtQ9zDGvocGMnRxNev UX72zFCOcuufYTlj/42oIHNumqXeMwOeuGWtJv n9UsefY0rMkXJaNRLoVGpJeeyXoUckkXFeuoBKEu V6kmYEGxX9hClVnuAtXCJNVDaUjlQ1xiNT Yil/iSTwHzSAMDoOjF9egMY/ZFgsWnP4BKmTHew /latexit latexitsha1 base64 " 5yNEZhVzC7GiNHA9QeX XHLpxg " AAACiHicbVFdSxtBFJ1sbavph7E jIYChFK2BXB iCICu2DD4qNCsmy3J3cJBdnZ5eZu8Ww Fv6qj p/6azMYUm6YWBM fc75sWmhyH4e9G8Grt9Zu36xvNd n2gd5eYHTwuMMxgbGpEC9lTS2i4TlsdyUFDCnQGD/ 2ByUDgtQ9zDGvocGMnRxNev UX72zFCOcuufYTlj/42oIHNumqXeMwOeuGWtJv n9UsefY0rMkXJaNRLoVGpJeeyXoUckkXFeuoBKEu V6kmYEGxX9hClVnuAtXCJNVDaUjlQ1xiNT Yil/iSTwHzSAMDoOjF9egMY/ZFgsWnP4BKmTHew /latexit latexitKsha1 base64 " 5yNEZhVzC7GiNHA9QeX XHLpxg " AAACiHicbVFdSxtBFJ1sbavph7E jIYChFK2BXB iCICu2DD4qNCsmy3J3cJBdnZ5eZu8Ww Fv6qj p/6azMYUm6YWBM fc75sWmhyH4e9G8Grt9Zu36xvNd n2gd5eYHTwuMMxgbGpEC9lTS2i4TlsdyUFDCnQGD/ 2ByUDgtQ9zDGvocGMnRxNev UX72zFCOcuufYTlj/42oIHNumqXeMwOeuGWtJv n9UsefY0rMkXJaNRLoVGpJeeyXoUckkXFeuoBKEu V6kmYEGxX9hClVnuAtXCJNVDaUjlQ1xiNT Yil/iSTwHzSAMDoOjF9egMY/ZFgsWnP4BKmTHew /latexit latexitG latexitsha1 base64 "q0LCCcSUIkRn3vfvuNtoAZaCfvw " 9960kMJiGN42gjtbd /d336w8/DR4ydPm7vP C4pyNlRgJztBTSfPVNCnxdVTRD/S07RLsOP9 E1WL xEly62d5an3zBlO7LpWk//TBg5H7 NSqMIhKL4oNHKSoqb1eGgmDHCUMw8YN8L/lfIJM4yjH JKlXnuAvhKJ XUKcF1BmusxCka5kkLmDOh6q7KT0JK o0pS8/FeIJ/VZ 2ltsnYizQds79ptTBhrNfSLQ /k3QP xGYTf6 rZ1dLxczTZ5Tl6QNonIO3JEzsgF6RFOfpJf5De5CfaDk Bz8GXhGjSWMXtkxYL H5O0zuo /latexit sha1 base64 "q0LCCcSUIkRn3vfvuNtoAZaCfvw " 9960kMJiGN42gjtbd /d336w8/DR4ydPm7vP C4pyNlRgJztBTSfPVNCnxdVTRD/S07RLsOP9 E1WL xEly62d5an3zBlO7LpWk//TBg5H7 NSqMIhKL4oNHKSoqb1eGgmDHCUMw8YN8L/lfIJM4yjH JKlXnuAvhKJ XUKcF1BmusxCka5kkLmDOh6q7KT0JK o0pS8/FeIJ/VZ 2ltsnYizQds79ptTBhrNfSLQ /k3QP xGYTf6 rZ1dLxczTZ5Tl6QNonIO3JEzsgF6RFOfpJf5De5CfaDk Bz8GXhGjSWMXtkxYL H5O0zuo /latexit latexitsha1 base64 "q0LCCcSUIkRn3vfvuNtoAZaCfvw " 9960kMJiGN42gjtbd /d336w8/DR4ydPm7vP C4pyNlRgJztBTSfPVNCnxdVTRD/S07RLsOP9 E1WL xEly62d5an3zBlO7LpWk//TBg5H7 NSqMIhKL4oNHKSoqb1eGgmDHCUMw8YN8L/lfIJM4yjH JKlXnuAvhKJ XUKcF1BmusxCka5kkLmDOh6q7KT0JK o0pS8/FeIJ/VZ 2ltsnYizQds79ptTBhrNfSLQ /k3QP xGYTf6 rZ1dLxczTZ5Tl6QNonIO3JEzsgF6RFOfpJf5De5CfaDk Bz8GXhGjSWMXtkxYL H5O0zuo /latexit latexitsha1 base64 "q0LCCcSUIkRn3vfvuNtoAZaCfvw " 9960kMJiGN42gjtbd /d336w8/DR4ydPm7vP C4pyNlRgJztBTSfPVNCnxdVTRD/S07RLsOP9 E1WL xEly62d5an3zBlO7LpWk//TBg5H7 NSqMIhKL4oNHKSoqb1eGgmDHCUMw8YN8L/lfIJM4yjH JKlXnuAvhKJ XUKcF1BmusxCka5kkLmDOh6q7KT0JK o0pS8/FeIJ/VZ 2ltsnYizQds79ptTBhrNfSLQ /k3QP xGYTf6 rZ1dLxczTZ5Tl6QNonIO3JEzsgF6RFOfpJf5De5CfaDk Bz8GXhGjSWMXtkxYL H5O0zuo /latexit latexit latexitOptimal controluxtxt 1 F(ut , utut t ( t )1 , ut 2 , . . .) tA dynamical system is connected in feedback with a controllerthat tries to get the closed loop to behave.Optimal control: find policy that minimizes some objective.Major challenge: how to perform optimalcontrol when the system is unknown?Today: Reinvent RL attempting to answer this question

HVACROOMsensorstatet( u) ·( uu pI) · gM Ṫ Q̇ ṁs cp (TsT)action

IdentifyeverythingIdentify acoarse modelHVACWe don’t need nostinking models!ROOMsensorstatet( u) ·( uu pI) · g PDE control High performanceaerodynamicsM Ṫ Q̇ ṁs cp (TsactionT) model predictivecontrol reinforcementlearning PID control?We need robust fundamentals todistinguish these approaches

But PID control works Bode Diagram504030One decadeMagnitude (dB)20102 6dB00.5 -6dBGain crossover point-10-20Loglog slope -1.5-30-40-50 -210-110011010Frequency (rad/sec)2102 parameters suffice for 95% of all control applications.How much needs to be modeled for more advanced control?Can we learn to compensate for poor models, changing conditions?

sha1 base64 "DOx/ktybitgjChwuZWtodyh8jiA " QqJEu4O7lJLs7ObmfuFsPi7 hr/Vn 5K5nerKfWe G5Zh9m1FC6twoTXxkCjx0s1pFvqe1C YlvsiIaIxHdxJE7FuWgJJf6If K/eAyCoBF8C6KX0KA2ydkSUxb8fAb8qMVJ /latexit sha1 base64 "DOx/ktybitgjChwuZWtodyh8jiA " QqJEu4O7lJLs7ObmfuFsPi7 hr/Vn 5K5nerKfWe G5Zh9m1FC6twoTXxkCjx0s1pFvqe1C YlvsiIaIxHdxJE7FuWgJJf6If K/eAyCoBF8C6KX0KA2ydkSUxb8fAb8qMVJ /latexit latexitsha1 base64 "DOx/ktybitgjChwuZWtodyh8jiA " QqJEu4O7lJLs7ObmfuFsPi7 hr/Vn 5K5nerKfWe G5Zh9m1FC6twoTXxkCjx0s1pFvqe1C YlvsiIaIxHdxJE7FuWgJJf6If K/eAyCoBF8C6KX0KA2ydkSUxb8fAb8qMVJ /latexit latexitsha1 base64 "DOx/ktybitgjChwuZWtodyh8jiA " QqJEu4O7lJLs7ObmfuFsPi7 hr/Vn 5K5nerKfWe G5Zh9m1FC6twoTXxkCjx0s1pFvqe1C YlvsiIaIxHdxJE7FuWgJJf6If K/eAyCoBF8C6KX0KA2ydkSUxb8fAb8qMVJ /latexit latexit t ( t ) is the policy. This is the optimization decision variable. latexitsha1 base64 "oTGOPnlC3lpbuJxkZHAlqk3gehs " Z/JKSUdx/KcV3bp95 69vfv7Dx4 evykffD03BlvBQ6FUcZe5uBQSY1DkqTwsrIIZa7wIr961 Ta SZW exTfUPKPDrN2J /Eq P9pI0 r0rFBub1HOvpTAFbrGK5mQhkA6pBKmbreoPUin FbTjZ3I6oxs1tG3k7ns5lcH9s3AufbiTHA6SbNu/C86P kncTz6/7py uaZ2wjotFfdanQ g /latexit sha1 base64 "oTGOPnlC3lpbuJxkZHAlqk3gehs " Z/JKSUdx/KcV3bp95 69vfv7Dx4 evykffD03BlvBQ6FUcZe5uBQSY1DkqTwsrIIZa7wIr961 Ta SZW exTfUPKPDrN2J /Eq P9pI0 r0rFBub1HOvpTAFbrGK5mQhkA6pBKmbreoPUin FbTjZ3I6oxs1tG3k7ns5lcH9s3AufbiTHA6SbNu/C86P kncTz6/7py uaZ2wjotFfdanQ g /latexit latexitsha1 base64 "oTGOPnlC3lpbuJxkZHAlqk3gehs " Z/JKSUdx/KcV3bp95 69vfv7Dx4 evykffD

K G Controller Design A dynamical system is connected in feedback with a controller that tries to get the closed loop to behave. Actions decided based on observed trajectories A mapping from trajectory to action is called a policy, Optimal control: find policy that minimizes some objective. x t x t 1 Axt But latexit sha1_base64 "swBJUJK950UX4Q9MabdzRZVkgIc .