Change display time — Currently: Central Daylight Time (CDT) (Event time)

Enough hype! Here's the research on AI powered feedback

,

Lecture presentation
ISTELive Content
Save to My Favorites
This is presentation 1 of 2, scroll down to see more details.

Other presentations in this group:

Session description

Enough AI hype! In this session, our team at the Stanford Graduate School of Education will share original research on AI and its role in providing feedback in K12 classrooms. Walk away with guidance on how AI should, and should not, be used in providing feedback on student work.

Framework

Our research grows from what we call a "power affirming" perspective on feedback. You can read more about that in our work here: https://dl.acm.org/doi/10.1145/3657604.3664680. This perspective grows from the work of other leading K12 feedback researchers, including:

1. Camilla Mutoni Griffiths, Lisel Murdock-Perriera, and Jennifer L Eberhardt. 2023. “Can you tell me more about this?”: Agentic written feedback, teacher expectations, and student learning. Contemporary Educational Psychology 73 (2023), 102145.
2. John Hattie and Helen Timperley. 2007. The power of feedback. Review of educational research 77, 1 (2007), 81–112.
3. Jody S Underwood and Alyson P Tregidgo. 2006. Improving student writing through effective feedback: Best practices and recommendations. Journal of Teaching Writing 22, 2 (2006), 73–98

More [+]

Methods

Feedback has the power to support students’ development as agentive writers or to reinforce the authority of the teacher. The degree to which feedback is power-affirming—legitimizing students’ ideas and positioning students as authors in the writing process—is a critical yet challenging construct to implement and measure. As teachers increasingly adopt AI-powered feedback solutions, it is important to understand how such feedback positions students.
In this work, we collect a dataset of 1,012 in-line feedback comments on 7-10th grade student essays by 20 experienced English Language Arts (ELA) teachers, along with 400 in-line comments by GPT3. We adapt a framework of power-affirming feedback to annotate and characterize this data, and we train RoBERTa-based models to automatically classify feedback comments along key dimensions of the framework. We find that teacher-written comments are significantly more power-affirming than AI comments and that power-affirming feedback occurs infrequently in both teacher-written and AI-generated feedback. Our results indicate that RoBERTa-based models can effectively measure dimensions of power-affirming feedback, indicating that computational measures can serve as a scalable method for identifying and facilitating power-affirming feedback.

3.1 Data Sources: To construct our dataset, we collect teacher-written and AI-generated feedback for a set of student essays. We include feedback from both sources to support the downstream applicability of our measures to assessing human and AI-feedback contexts. We sample student essays from the publicly available ASAP-AES dataset [9], a resource frequently cited in automated essay scoring research [17]. The dataset includes persuasive, narrative, and literary analysis essays from middle and high school grade levels. We limit our selection to essays with at least 100 words and employ stratified random sampling based on their scores for a balanced representation of low and high-scoring essays. We select 300 essays evenly distributed across genres and grade levels. We recruit 20 experienced (𝑀 = 8.25 years) middle and high school English Language Arts (ELA) teachers from the alumni network of a selective professional development fellowship. We instruct teachers to treat the essays as first drafts due for revision and to provide feedback following their usual practices. In an ongoing process, teachers highlight excerpts from the student essays and provide inline feedback comments through a digital interface. We have currently collected 612 comments, with an anticipated total of 2,000 comments by the end of the school year. We select a subset of 60 essays via simple random sampling and generate 400 inline feedback comments with associated highlighted excerpts using GPT3-turbo [1] via OpenAI API [12]. We employ two prompts (Figure 3) to elicit Figure 2: Base and framework-specific prompting strategies for generating feedback with GPT3-turbo. both general and framework-informed feedback strategies. The first prompt specifies the grade level, ELA instructional context, and the goal of providing feedback to support revision, while the second prompt additionally specifies the goals and characteristics of power-affirming feedback.

3.2 Annotation: We develop the annotation scheme by adapting Pedersen’s [14] framework for power-affirming feedback. We iteratively modify the framework through preliminary coding trials to improve codability on our feedback data. For our power-affirming dimension, we merge Pedersen’s characteristics of power-preserving feedback into the power-concealing category to create a single, more robust contrast to power-affirming feedback. For our actionable dimension, we separate not easily actionable and vague—originally power-concealing characteristics—into independently codable inverse labels. For our dialogic & revision-oriented dimension, we specify the type of feedback addressed by the framework by defining inverse-labels such as praise and direct grammatical corrections. We recruit and train two doctoral students specializing in teacher education, both with experience in teaching ELA at middle and high school grade levels, to apply our annotation scheme. We prepare an annotation interface, formatted as a spreadsheet that displays feedback comments, their associated excerpts of student text, and checkboxes for each possible label. The interface additionally provides the full student essay, the essay prompt, and the grade level as context for each item. We assign each of the 1,012 feedback comments to both annotators and obtain an inter-rater agreement of Cohen’s 𝜅 = 0.750 for power-affirming (substantial), 𝜅 = 0.499 for actionable (moderate), and 𝜅 = 0.840 for dialogic & revision-oriented (near-perfect).

3.3 Modeling: We derive a single label for each coded dimension of each comment. To quantify the extent to which a feedback comment is poweraffirming, we convert the ratings to numeric values on a continuous scale. We standardize these values within each annotator’s set to account for between-rater variability in labeling tendencies and compute the average of the two annotator’s scores for each comment. A comment is categorized as actionable if neither annotator labeled it vague or not easily actionable. A comment is categorized as dialogic & revision-oriented if neither rater labeled it as not dialogic or not revision-oriented. We use our annotated data to fine-tune pre-trained language models [11]. For our power-affirming model, we fine-tune a RoBERTabase regression model to predict for each new feedback pair a PA score on a continuous scale between 0 and 1, where larger values represent more power-affirming language. For our actionable and dialogic & revision-oriented models, we fine-tune RoBERTa-base binary classification models to predict whether a feedback pair meets the specification of the dimension. We train all models using the Simple Transformers library [15] for 5 epochs using a batch size of 8. We validate the accuracy of our models through 5-fold cross-validation.

More [+]

Results

(Please note that, at time of submission, we are actively conducting other research that we may want to include in our session given it is almost a year away at this point. That research will grow from the research being shared here)

Power-affirming feedback occurs infrequently, even among experienced teachers. Across the annotated corpus, we find an average PA-score of 0.269, reflecting a significant inclination towards power-concealing feedback. With only 19% of feedback classified as more power-affirming (scoring above 0.5), the data supports Pedersen’s finding about the persistent norm of teacher authority [14]–even when presented with a friendly tone. For experienced teachers, we find an average PA-score of 0.346, and 34% of feedback classified as more power-affirming.

AI-generated feedback tends to be more revision-oriented and actionable but also far more power-concealing. We report summary statistics for each coded dimension in Table 2. We find that AI-generated feedback includes 16% more comments labeled as actionable and 47% more comments labeled as dialogic and revision-oriented. Teacher-written feedback may target additional objectives not centered by our measured dimensions, often including direct corrections, praise, and commentary unrelated to revision. However, teacher feedback is significantly more power-affirming than AI-generated feedback, with an average PA-score more than twice that of feedback generated with a base prompt. Although enhancing prompt-engineering for AI feedback improves its average PA-score from 0.148 to 0.257, its upper bounds remain stagnant and only teacher-written feedback includes instances in the more power-affirming range. Despite reports that ChatGPT can approximate human feedback quality [3, 19], its positioning of student authority and agency is one measured dimension along which AI-generated feedback falters with the potential for harm.

4.2 Modeling
We find that the RoBERTa-based models can effectively measure dimensions of power-affirming feedback, indicating that computational measures can serve as a scalable method for identifying power-affirming feedback. The comparable performance of the models across feedback sources supports their generalizability across human and generated feedback contexts. Our power-affirming model achieves a strong positive Spearman correlation 𝜌 of 0.815 (𝑝 < 0.001) with human expert labels. The dialogic & revision-oriented model achieves a high accuracy of 0.944 and a macro-F1 score of 0.878. In contrast, the actionable model shows weaker results with an accuracy of 0.722 and a macro-F1 of only 0.485. This result reflects the lower inter-rater agreement for the dimension and suggests that further specification is required to more effectively capture its complexities.

More [+]

Importance

This work has direct implications for the quality of automated feedback tools and facilitating teachers’ feedback practices. Given the relative scarcity of power-affirming feedback, even from experienced teachers, our measures can be applied in an effort to improve power-affirming capabilities. Our measures present opportunities for continuous evaluation of large language models in education applications. By applying the measures as benchmarks to generated feedback data, practitioners can adapt prompting and tuning strategies to better align with educational goals. Further, to improve the upper bound of AI-generated feedback, we intend to use model-based reinforcement learning to train agents for feedback generation and rewrite tasks. We plan to leverage our models during training to reward power-affirming, actionable, dialogic, and revision-oriented feedback. Finally, we intend to partner with districts and teacher education programs to explore opportunities to share our research and tools in designing professional development focused on writing feedback.

More [+]

References

[1] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, et al. 2020. Language models are few-shot learners. Advances in neural
information processing systems 33 (2020), 1877–1901.
[2] Anderson Pinheiro Cavalcanti, Arthur Diego, Rafael Ferreira Mello, Katerina
Mangaroska, André Nascimento, Fred Freitas, and Dragan Gašević. 2020. How
good is my feedback? a content analysis of written feedback. In Proceedings of
the tenth international conference on learning analytics & knowledge. 428–437.
[3] Wei Dai, Jionghao Lin, Hua Jin, Tongguang Li, Yi-Shan Tsai, Dragan Gašević, and
Guanliang Chen. 2023. Can large language models provide feedback to students?
A case study on ChatGPT. In 2023 IEEE International Conference on Advanced
Learning Technologies (ICALT). IEEE, 323–325.
[4] Larry. Ferlazzo. 2023. Opinion: How Teachers Are Using ChatGPT in Class.
Education Week (07 2023). https://www.edweek.org/technology/opinion-howteachers-are-using-chatgpt-in-class/2023/07 Accessed: 2024-03-02.
[5] Camilla Mutoni Griffiths, Lisel Murdock-Perriera, and Jennifer L Eberhardt. 2023.
“Can you tell me more about this?”: Agentic written feedback, teacher expectations, and student learning. Contemporary Educational Psychology 73 (2023),
102145.
[6] John Hattie and Helen Timperley. 2007. The power of feedback. Review of
educational research 77, 1 (2007), 81–112.
[7] George Hillocks. 2002. The testing trap: How state writing assessments control
learning. Teachers College Press.
[8] Stephen Hutt, Allison DePiro, Joann Wang, Sam Rhodes, Ryan S Baker, Grayson
Hieb, Sheela Sethuraman, Jaclyn Ocumpaugh, and Caitlin Mills. 2024. Feedback
on Feedback: Comparing Classic Natural Language Processing and Generative
AI to Evaluate Peer Feedback. In Proceedings of the 14th Learning Analytics and
Knowledge Conference. 55–65.
[9] Kaggle. 2013. ASAP Automated Student Assessment Prize - AES. https://www.
kaggle.com/c/asap-aes/data. Accessed: 2024-02-15.
[10] Rhea Kelly. 2023. Khan Academy Cuts District Price of Khanmigo AI
Teaching Assistant, Adds Academic Essay Feature. THE Journal (16 11
2023). https://thejournal.com/Articles/2023/11/16/Khan-Academy-Cuts-DistrictPrice-of-Khanmigo-AI-Teaching-Assistant.aspx Accessed: 2024-02-03.
[11] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A
robustly optimized bert pretraining approach. arXiv:1907.11692 https://doi.org/
10.48550/arXiv.1907.11692
[12] OpenAI. 2020. GPT-3: OpenAI API. https://platform.openai.com/docs/apireference. Accessed: 2024-04-01.
[13] Ikenna Osakwe, Guanliang Chen, Alex Whitelock-Wainwright, Dragan Gašević, Anderson Pinheiro Cavalcanti, and Rafael Ferreira Mello. 2022. Towards
automated content analysis of educational feedback: A multi-language study.
Computers and Education: Artificial Intelligence 3 (2022), 100059.
[14] Joelle Pedersen. 2018. Revision as Dialogue: Exploring Question Posing in Writing
Response. Journal of Adolescent & Adult Literacy 62, 2 (June 2018), 185–194.
https://doi.org/10.1002/jaal.885
[15] TC Rajapakse. 2019. Simple transformers. (2019). Retrieved August 25 2022,
from https://simpletransformers.
[16] Jonathan Rosa and Christa Burdick. 2017. Language ideologies. The Oxford
handbook of language and society (2017), 103–123.
[17] Mark D. Shermis. 2014. State-of-the-art automated essay scoring: Competition,
results, and future directions from a United States demonstration. Assessing
Writing 20 (April 2014), 53–76. https://doi.org/10.1016/j.asw.2013.04.001
[18] Natasha Singer. 2023. In Classrooms, Teachers Put A.I. Tutoring Bots to the
Test. The New York Times (26 06 2023). https://www.nytimes.com/2023/06/26/
technology/newark-schools-khan-tutoring-ai.html Accessed: 2024-03-02.
[19] Jacob Steiss, Tamara Tate, Steve Graham, Jazmin Cruz, Michael Hebert, Jiali
Wang, Youngsun Moon, Waverly Tseng, Mark Warschauer, and Carol Booth
Olson. 2024. Comparing the quality of human and ChatGPT feedback of students’
writing. Learning and Instruction 91 (2024), 101894.
[20] Jody S Underwood and Alyson P Tregidgo. 2006. Improving student writing
through effective feedback: Best practices and recommendations. Journal of
Teaching Writing 22, 2 (2006), 73–98.
[21] Rose E Wang, Qingyang Zhang, Carly Robinson, Susanna Loeb, and Dorottya
Demszky. 2023. Step-by-Step Remediation of Students’ Mathematical Mistakes.
arXiv preprint arXiv:2310.10648 (2023).
[22] Su-Youn Yoon, Eva Miszoglad, and Lisa R Pierce. 2023. Evaluation of ChatGPT Feedback on ELL Writers’ Coherence and Cohesion. arXiv preprint
arXiv:2310.06505 (2023).
[23] Haoran Zhang, Ahmed Magooda, Diane Litman, Richard Correnti, Elaine Wang,
LC Matsmura, Emily Howe, and Rafael Quintana. 2019. eRevise: Using natural
language processing to provide formative feedback on text evidence usage in
student writing. In Proceedings of the AAAI conference on artificial intelligence,
Vol. 33. 9619–9625.
[24] Jiayi Zhang, Ryan S Baker, JM Andres, Stephen Hutt, and Sheela Sethuraman.
2023. Automated Multi-Dimensional Analysis of Peer Feedback in Middle School
Mathematics. In Proceedings of the 16th International Conference on ComputerSupported Collaborative Learning-CSCL 2023, pp. 221-224. International Society of
the Learning Sciences.

More [+]

Presenters

Photo
PhD Student
Stanford Graduate School of Education
Graduate student
Photo
Student
Stanford Graduate School of Education
Graduate student

Session specifications

Topic:

Artificial Intelligence

TLP:

Yes

Grade level:

PK-12

Audience:

Curriculum Designer/Director, Teacher, Technology Coach/Trainer

Attendee devices:

Devices useful

Attendee device specification:

Smartphone: Android, iOS, Windows
Laptop: PC, Chromebook, Mac
Tablet: Android, iOS, Windows

Subject area:

Interdisciplinary (STEM/STEAM), Not applicable

ISTE Standards:

For Educators:
Learner
  • Stay current with research that supports improved student learning outcomes, including findings from the learning sciences.

TLPs:

Develop Expertise

Influencer Disclosure:

This session includes a presenter that indicated a “material connection” to a brand that includes a personal, family or employment relationship, or a financial relationship. See individual speaker menu for disclosure information.