Skip to main navigation menu Skip to main content Skip to site footer

Peer Reviewed Article

Vol. 7 (2020)

Reward Redistribution as Align-RUDDER: Learning from a Few Demonstrations

Published
2020-02-15

Abstract

Reinforcement to handle difficult tasks with sparse and delayed rewards, learning algorithms demand a large number of samples. Complex tasks are frequently broken down into sub-tasks in a hierarchical manner. A step in the Q-function corresponds to the completion of a sub-task in which the return expectation rises. RUDDER was created to identify these phases and then shift rewards to them, resulting in rapid rewards when sub-tasks are completed. Learning is significantly accelerated since the problem of delayed rewards is alleviated. Current exploration strategies, such as those used in RUDDER, struggle to find episodes with large rewards when dealing with difficult tasks. As a result, we presume that high-reward episodes are presented as demonstrations and do not need to be found through exploration. The number of demonstrations is typically low, and RUDDER's LSTM model does not learn effectively as a deep learning method. As a result, we present Align-RUDDER, which is RUDDER with two major changes. First, Align-RUDDER implies that high-reward episodes are presented as demos, replacing RUDDER's safe exploration and lesson replay buffer. Second, we substitute RUDDER's LSTM model with a profile model derived from multiple demonstration sequence alignment. Bioinformatics has shown that profile models may be built with as little as two demos. Align-RUDDER inherits the concept of reward redistribution, which lowers the time between incentives and hence accelerates learning. On complex artificial tasks with delayed rewards and limited demonstrations, Align-RUDDER surpasses competitors. Align-RUDDER can mine a diamond on the MineCraft obtain Diamond assignment, but only infrequently.

References

  1. Ahmed, A.A.A. (2021). Event Ticketing Accounting Information System using RFID within the COVID-19 Fitness Etiquettes. Academia Letters, Article 1379. https://doi.org/10.20935/AL1379
  2. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. 1990. Basic local alignment search tool. J. Molec. Biol., 214:403–410, 1990.
  3. Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman D. J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25(17):3389–3402, 1997. doi: 10.1093/nar/25.17.3389.
  4. Antonoglou, I., V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489. doi:10.1038/nature16961.
  5. Arjona-Medina, J. A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J. and Hochreiter, S. 2019. RUDDER: return decomposition for delayed rewards. In Advances in Neural Information Processing Systems 32, pp. 13566–13577.
  6. Bynagari, N. B. & Ahmed, A. A. A. (2021). Anti-Money Laundering Recognition through the Gradient Boosting Classifier. Academy of Accounting and Financial Studies Journal, 25(5), 1–11. https://doi.org/10.5281/zenodo.5523918
  7. Bynagari, N. B. (2017). Prediction of Human Population Responses to Toxic Compounds by a Collaborative Competition. Asian Journal of Humanity, Art and Literature, 4(2), 147-156. https://doi.org/10.18034/ajhal.v4i2.577
  8. Bynagari, N. B. (2018). On the ChEMBL Platform, a Large-scale Evaluation of Machine Learning Algorithms for Drug Target Prediction. Asian Journal of Applied Science and Engineering, 7, 53–64. Retrieved from https://upright.pub/index.php/ajase/article/view/31
  9. Bynagari, N. B. (2019). GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. Asian Journal of Applied Science and Engineering, 8, 25–34. Retrieved from https://upright.pub/index.php/ajase/article/view/32
  10. Bynagari, N. B., & Amin, R. (2019). Information Acquisition Driven by Reinforcement in Non-Deterministic Environments. American Journal of Trade and Policy, 6(3), 107-112. https://doi.org/10.18034/ajtp.v6i3.569
  11. Bynagari, N. B., & Fadziso, T. (2018). Theoretical Approaches of Machine Learning to Schizophrenia. Engineering International, 6(2), 155-168. https://doi.org/10.18034/ei.v6i2.568
  12. Ganapathy, A., Vadlamudi, S., Ahmed, A. A. A., Hossain, M. S., Islam, M. A. (2021). HTML Content and Cascading Tree Sheets: Overview of Improving Web Content Visualization. Turkish Online Journal of Qualitative Inquiry, 12(3), 2428-2438. https://doi.org/10.5281/zenodo.5522159
  13. Hester, T., M. Vecerík, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband, G. Dulac-Arnold, J. Agapiou, J. Z. Leibo, and A. Gruslys. 2018. Deep q-learning from demonstrations. In The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18). Association for the Advancement of Artificial Intelligence, 2018.
  14. Ho J. and Ermon S. 2016. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems 29, pp. 4565–4573, 2016.
  15. Hochreiter S. and Schmidhuber J. 1995. Long short-term memory. Technical Report FKI-207-95, Fakultätfür Informatik, Technische Universität München, 1995.
  16. Hochreiter S. and Schmidhuber J. 1997a. Long short-term memory. Neural Comput., 9(8):1735–1780.
  17. Hochreiter S. and Schmidhuber J. 1997b. LSTM can solve hard long time lag problems. In M. C. Mozer,
  18. Hussain, S., Ahmed, A. A. A., Kurniullah, A. Z., Ramirez-Asis, E., Al-Awawdeh, N., Al-Shamayleh, N. J. M., Julca-Guerrero, F. (2021). Protection against Letters of Credit Fraud. Journal of Legal, Ethical and Regulatory Issues, 24(Special Issue 1), 1-11. https://doi.org/10.5281/zenodo.5507840
  19. Luoma, J., Ruutu, S., King, A. W. and Tikkanen H. 2017. Time delays, competitive interdependence, and firm performance. Strategic Management Journal, 38(3):506–525. doi: 10.1002/smj.2512.
  20. Manavalan, M. (2016). Biclustering of Omics Data using Rectified Factor Networks. International Journal of Reciprocal Symmetry and Physical Sciences, 3, 1–10. Retrieved from https://upright.pub/index.php/ijrsps/article/view/40
  21. Manavalan, M. (2018). Do Internals of Neural Networks Make Sense in the Context of Hydrology?. Asian Journal of Applied Science and Engineering, 7, 75–84. Retrieved from https://upright.pub/index.php/ajase/article/view/41
  22. Manavalan, M. (2019a). P-SVM Gene Selection for Automated Microarray Categorization. International Journal of Reciprocal Symmetry and Physical Sciences, 6, 1–7. Retrieved from https://upright.pub/index.php/ijrsps/article/view/43
  23. Manavalan, M. (2019b). Using Fuzzy Equivalence Relations to Model Position Specificity in Sequence Kernels. Asian Journal of Applied Science and Engineering, 8, 51–64. Retrieved from https://upright.pub/index.php/ajase/article/view/42
  24. Manavalan, M., & Bynagari, N. B. (2015). A Single Long Short-Term Memory Network can Predict Rainfall-Runoff at Multiple Timescales. International Journal of Reciprocal Symmetry and Physical Sciences, 2, 1–7. Retrieved from https://upright.pub/index.php/ijrsps/article/view/39
  25. Manavalan, M., & Chisty, N. M. A. (2019). Visualizing the Impact of Cyberattacks on Web-Based Transactions on Large-Scale Data and Knowledge-Based Systems. Engineering International, 7(2), 95-104. https://doi.org/10.18034/ei.v7i2.578
  26. Manavalan, M., & Donepudi, P. K. (2016). A Sample-based Criterion for Unsupervised Learning of Complex Models beyond Maximum Likelihood and Density Estimation. ABC Journal of Advanced Research, 5(2), 123-130. https://doi.org/10.18034/abcjar.v5i2.581
  27. Manojkumar, P., Suresh, M., Ahmed, A. A. A., Panchal, H., Rajan, C. C. A., Dheepanchakkravarthy, A., Geetha, A., Gunapriya, B., Mann, S., & Sadasivuni, K. K. (2021). A novel home automation distributed server management system using Internet of Things. International Journal of Ambient Energy, https://doi.org/10.1080/01430750.2021.1953590
  28. Needleman S. B. and Wunsch C. D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443–453, 1970.
  29. Panchal, H., Sadasivuni, K. K., Ahmed, A. A. A., Hishan, S. S., Doranehgard, M. H., Essa, F. A., Shanmugan, S., & Khalid, M. (2021). Graphite powder mixed with black paint on the absorber plate of the solar still to enhance yield: An experimental investigation. Desalination, Volume 520. https://doi.org/10.1016/j.desal.2021.115349
  30. Rahmandad, H., Repenning, N. and Sterman J. 2009. Effects of feedback delay on learning. System Dynamics Review, 25(4):309–338. doi: 10.1002/sdr.427.
  31. Raya, I., Kzar, H. H., Mahmoud, Z. H., Ahmed, A. A. A., Ibatova, A. Z., & Kianfar, E. (2021). A review of gas sensors based on carbon nanomaterial. Carbon Letters. Article No: 276. https://doi.org/10.1007/s42823-021-00276-9
  32. Reddy, S., Dragan, A. D. and. Levine S. 2020. SQIL: imitation learning via regularized behavioral cloning. ArXiv, 2020. Eighth International Conference on Learning Representations (ICLR).
  33. Scheller, C., Y. Schraner, and M. Vogel. 2020. Sample efficient reinforcement learning through learning from demonstrations in Minecraft. arXiv, abs/2003.06066, 2020.
  34. Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov O. 2018. Proximal policy optimization algorithms. ArXiv, 2018.
  35. Sharma, D. K., Chakravarthi, D. S., Shaikh, A. A., Ahmed, A. A. A., Jaiswal, S., Naved, M. (2021). The aspect of vast data management problem in healthcare sector and implementation of cloud computing technique. Materials Today: Proceedings. https://doi.org/10.1016/j.matpr.2021.07.388
  36. Silver, D., A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, Frey B. J. and Dueck D. 2007. Clustering by passing messages between data points. Science, 315(5814): 972–976, 2007. doi: 10.1126/science.1136800.
  37. Smith T. F. and Waterman M. S. 1981. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195–197, 1981
  38. Stormo, G. D., Schneider, T. D., Gold, L. and Ehrenfeucht A. 1982. Use of the ‘Perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Research, 10(9):2997–3011, 1982.
  39. Sutton R. S. and Barto A. G. 2018. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 2 edition.
  40. Sutton, R. S., Precup, D. and Singh S. P. 1999. Between MDPs and Semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1-2):181–211, 1999.

Similar Articles

You may also start an advanced similarity search for this article.