Towards Explainable Sign Spotting Systems: an Exploration of Approximative Linguistic Features and Evaluation Methods

Towards Explainable Sign Spotting Systems: an Exploration of Approximative Linguistic Features and Evaluation Methods. Hollain, N. Master's thesis, Radboud University, 2023.

Paper abstract bibtex

This research project carried out an initial exploration into how sign spotting, the task of detecting when a target sign occurs in a given video, can be performed in a more explainable manner. Explainability demands that a system is correct, robust and interpretable to humans [1], [2]. Inspired by domain knowledge being used to increase the explainability and interpretability of systems in other domains [3], [4], we investigate the possibility of using a knowledge-based approach for sign spotting. One manner in which knowledge about sign language can be incorporated into sign language systems is through linguistic insights. Current sign spotting systems typically do not make use of such knowledge [5], thus limiting their interpretability. Similarly, evaluation methods for sign spotting do not draw on linguistic knowledge, resulting in a lack of explainability since they fail to robustly estimate model performance given an incomplete ground truth. Updates to the known ground truth, in particular the addition of challenging sign annotations, can significantly alter the estimated performance. Moreover, current evaluations do not reflect user expectations for sign spotting systems because spottings are allowed to occur after a relevant segment has already started. Users thus have to put in effort, such as rewinding the video, to watch the full relevant segment, which was found to not reflect what users expect [6]. The goal of this thesis is to address these limitations using a knowledge-based approach. We incorporate linguistic knowledge about sign language into a sign spotting system and evaluation method. We aim to enhance explainability by enabling a sign spotting analysis based on linguistic insights. Furthermore, we develop linguistic features to ensure our model uses knowledge-based inputs as the basis for its decision-making. In this way, we hope to increase the explainability of current methods. To address the need for explainable sign spotting systems, we implemented features for a sign spotting model that approximate the basic phonological properties of signs, including handshape, orientation, location and movement [7], [8]. Our features are extracted from landmarks, which are keypoints in the body, such as the fingertips and shoulders, that we detected using a landmark detection tool. As far as we are aware, we are the first to implement a sign spotting model which extracts such features from landmarks. By taking into account the basic four phonological properties, we aim to create explainable sign representations for our model to encode. As a result, it is possible to perform a failure analysis for our model that is facilitated by the linguistic features. To address the need for explainable evaluation methods for sign spotting, we developed an evaluation that is rooted in the concept of tolerance to irrelevance (TTI) [9]. TTI builds on the assumption that users, given an entry point in a video or audio stream, keep watching or listening until their tolerance to irrelevant content is reached. Through this means, our evaluation method reflects the effort it takes for users to use a sign spotting system. However, TTI, like existing sign spotting evaluations, relies on a full ground truth to reliably determine a model’s performance, which may not be available for a sign spotting dataset. We address this limitation by using a novel approach that uses only the most challenging known cases to assess our model performance. These hardest cases are called distractors, which we define as those signs that are most similar to the target sign based on a distance measure. In our work, we develop a novel linguistic distance measure to determine the similarity between signs. Through the usage of these distractors, we estimate the performance for the full ground truth based solely on the hardest cases from the known ground truth, and assume that this makes our performance estimation robust to the addition of new annotations. We validated this assumption by investigating the effects of updates to the annotations on the performance estimates by our distractor-based evaluation compared to a baseline evaluation that uses random, as opposed to hard, cases. Our results show that the distractor-based evaluations provides a more conservative estimate of the performance of a model and is comparably robust to changes in the annotations compared to the baseline. We validated our linguistic features using an empirical analysis, where we compared the effectiveness of a non-linguistic baseline that used landmarks directly, to a model using our more explainable, linguistically motivated features that are extracted from landmarks. Moreover, we investigated whether a combination of linguistic and baseline landmark features would result in better performance. The conditions were compared through the use of our distractor-based evaluation. We determined that the combination of the features provided the best performance at the cost of linguistic representativeness.

@mastersthesis{hollain:23c,
title = {Towards Explainable Sign Spotting Systems: an Exploration of Approximative Linguistic Features and Evaluation Methods},
author = {Hollain, Natalie},
school = {Radboud University},
year = {2023},
url = {https://www.signlab-amsterdam.nl/publications/Natalie-Hollain-thesis.pdf},
abstract = {This research project carried out an initial exploration into how sign spotting, the task of detecting when a target sign occurs in a given video, can be performed in a more explainable manner. Explainability demands that a system is correct, robust and interpretable to humans [1], [2]. Inspired by domain knowledge being used to increase the explainability and interpretability of systems in other domains [3], [4], we investigate the possibility of using a knowledge-based approach for sign spotting.

One manner in which knowledge about sign language can be incorporated into sign language systems is through linguistic insights. Current sign spotting systems typically do not make use of such knowledge [5], thus limiting their interpretability. Similarly, evaluation methods for sign spotting do not draw on linguistic knowledge, resulting in a lack of explainability since they fail to robustly estimate model performance given an incomplete ground truth. Updates to the known ground truth, in particular the addition of challenging sign annotations, can significantly alter the estimated performance. Moreover, current evaluations do not reflect user expectations for sign spotting systems because spottings are allowed to occur after a relevant segment has already started. Users thus have to put in effort, such as rewinding the video, to watch the full relevant segment, which was found to not reflect what users expect [6].

The goal of this thesis is to address these limitations using a knowledge-based approach. We incorporate linguistic knowledge about sign language into a sign spotting system and evaluation method. We aim to enhance explainability by enabling a sign spotting analysis based on linguistic insights. Furthermore, we develop linguistic features to ensure our model uses knowledge-based inputs as the basis for its decision-making. In this way, we hope to increase the explainability of current methods.

To address the need for explainable sign spotting systems, we implemented features for a sign spotting model that approximate the basic phonological properties of signs, including handshape, orientation, location and movement [7], [8]. Our features are extracted from landmarks, which are keypoints in the body, such as the fingertips and shoulders, that we detected using a landmark detection tool. As far as we are aware, we are the first to implement a sign spotting model which extracts such features from landmarks. By taking into account the basic four phonological properties, we aim to create explainable sign representations for our model to encode. As a result, it is possible to perform a failure analysis for our model that is facilitated by the linguistic features.

To address the need for explainable evaluation methods for sign spotting, we developed an evaluation that is rooted in the concept of tolerance to irrelevance (TTI) [9]. TTI builds on the assumption that users, given an entry point in a video or audio stream, keep watching or listening until their tolerance to irrelevant content is reached. Through this means, our evaluation method reflects the effort it takes for users to use a sign spotting system.

However, TTI, like existing sign spotting evaluations, relies on a full ground truth to reliably determine a model’s performance, which may not be available for a sign spotting dataset. We address this limitation by using a novel approach that uses only the most challenging known cases to assess our model performance. These hardest cases are called distractors, which we define as those signs that are most similar to the target sign based on a distance measure. In our work, we develop a novel linguistic distance measure to determine the similarity between signs. Through the usage of these distractors, we estimate the performance for the full ground truth based solely on the hardest cases from the known ground truth, and assume that this makes our performance estimation robust to the addition of new annotations. We validated this assumption by investigating the effects of updates to the annotations on the performance estimates by our distractor-based evaluation compared to a baseline evaluation that uses random, as opposed to hard, cases. Our results show that the distractor-based evaluations provides a more conservative estimate of the performance of a model and is comparably robust to changes in the annotations compared to the baseline.

We validated our linguistic features using an empirical analysis, where we compared the effectiveness of a non-linguistic baseline that used landmarks directly, to a model using our more explainable, linguistically motivated features that are extracted from landmarks. Moreover, we investigated whether a combination of linguistic and baseline landmark features would result in better performance. The conditions were compared through the use of our distractor-based evaluation. We determined that the combination of the features provided the best performance at the cost of linguistic representativeness.}
}

Downloads: 0

{"_id":"iRNnSxy4t2iAxvAsC","bibbaseid":"hollain-towardsexplainablesignspottingsystemsanexplorationofapproximativelinguisticfeaturesandevaluationmethods-2023","author_short":["Hollain, N."],"bibdata":{"bibtype":"mastersthesis","type":"mastersthesis","title":"Towards Explainable Sign Spotting Systems: an Exploration of Approximative Linguistic Features and Evaluation Methods","author":[{"propositions":[],"lastnames":["Hollain"],"firstnames":["Natalie"],"suffixes":[]}],"school":"Radboud University","year":"2023","url":"https://www.signlab-amsterdam.nl/publications/Natalie-Hollain-thesis.pdf","abstract":"This research project carried out an initial exploration into how sign spotting, the task of detecting when a target sign occurs in a given video, can be performed in a more explainable manner. Explainability demands that a system is correct, robust and interpretable to humans [1], [2]. Inspired by domain knowledge being used to increase the explainability and interpretability of systems in other domains [3], [4], we investigate the possibility of using a knowledge-based approach for sign spotting. One manner in which knowledge about sign language can be incorporated into sign language systems is through linguistic insights. Current sign spotting systems typically do not make use of such knowledge [5], thus limiting their interpretability. Similarly, evaluation methods for sign spotting do not draw on linguistic knowledge, resulting in a lack of explainability since they fail to robustly estimate model performance given an incomplete ground truth. Updates to the known ground truth, in particular the addition of challenging sign annotations, can significantly alter the estimated performance. Moreover, current evaluations do not reflect user expectations for sign spotting systems because spottings are allowed to occur after a relevant segment has already started. Users thus have to put in effort, such as rewinding the video, to watch the full relevant segment, which was found to not reflect what users expect [6]. The goal of this thesis is to address these limitations using a knowledge-based approach. We incorporate linguistic knowledge about sign language into a sign spotting system and evaluation method. We aim to enhance explainability by enabling a sign spotting analysis based on linguistic insights. Furthermore, we develop linguistic features to ensure our model uses knowledge-based inputs as the basis for its decision-making. In this way, we hope to increase the explainability of current methods. To address the need for explainable sign spotting systems, we implemented features for a sign spotting model that approximate the basic phonological properties of signs, including handshape, orientation, location and movement [7], [8]. Our features are extracted from landmarks, which are keypoints in the body, such as the fingertips and shoulders, that we detected using a landmark detection tool. As far as we are aware, we are the first to implement a sign spotting model which extracts such features from landmarks. By taking into account the basic four phonological properties, we aim to create explainable sign representations for our model to encode. As a result, it is possible to perform a failure analysis for our model that is facilitated by the linguistic features. To address the need for explainable evaluation methods for sign spotting, we developed an evaluation that is rooted in the concept of tolerance to irrelevance (TTI) [9]. TTI builds on the assumption that users, given an entry point in a video or audio stream, keep watching or listening until their tolerance to irrelevant content is reached. Through this means, our evaluation method reflects the effort it takes for users to use a sign spotting system. However, TTI, like existing sign spotting evaluations, relies on a full ground truth to reliably determine a model’s performance, which may not be available for a sign spotting dataset. We address this limitation by using a novel approach that uses only the most challenging known cases to assess our model performance. These hardest cases are called distractors, which we define as those signs that are most similar to the target sign based on a distance measure. In our work, we develop a novel linguistic distance measure to determine the similarity between signs. Through the usage of these distractors, we estimate the performance for the full ground truth based solely on the hardest cases from the known ground truth, and assume that this makes our performance estimation robust to the addition of new annotations. We validated this assumption by investigating the effects of updates to the annotations on the performance estimates by our distractor-based evaluation compared to a baseline evaluation that uses random, as opposed to hard, cases. Our results show that the distractor-based evaluations provides a more conservative estimate of the performance of a model and is comparably robust to changes in the annotations compared to the baseline. We validated our linguistic features using an empirical analysis, where we compared the effectiveness of a non-linguistic baseline that used landmarks directly, to a model using our more explainable, linguistically motivated features that are extracted from landmarks. Moreover, we investigated whether a combination of linguistic and baseline landmark features would result in better performance. The conditions were compared through the use of our distractor-based evaluation. We determined that the combination of the features provided the best performance at the cost of linguistic representativeness.","bibtex":"@mastersthesis{hollain:23c,\n title = {Towards Explainable Sign Spotting Systems: an Exploration of Approximative Linguistic Features and Evaluation Methods},\n author = {Hollain, Natalie},\n school = {Radboud University},\n year = {2023},\n url = {https://www.signlab-amsterdam.nl/publications/Natalie-Hollain-thesis.pdf},\n abstract = {This research project carried out an initial exploration into how sign spotting, the task of detecting when a target sign occurs in a given video, can be performed in a more explainable manner. Explainability demands that a system is correct, robust and interpretable to humans [1], [2]. Inspired by domain knowledge being used to increase the explainability and interpretability of systems in other domains [3], [4], we investigate the possibility of using a knowledge-based approach for sign spotting.\n\n One manner in which knowledge about sign language can be incorporated into sign language systems is through linguistic insights. Current sign spotting systems typically do not make use of such knowledge [5], thus limiting their interpretability. Similarly, evaluation methods for sign spotting do not draw on linguistic knowledge, resulting in a lack of explainability since they fail to robustly estimate model performance given an incomplete ground truth. Updates to the known ground truth, in particular the addition of challenging sign annotations, can significantly alter the estimated performance. Moreover, current evaluations do not reflect user expectations for sign spotting systems because spottings are allowed to occur after a relevant segment has already started. Users thus have to put in effort, such as rewinding the video, to watch the full relevant segment, which was found to not reflect what users expect [6].\n\n The goal of this thesis is to address these limitations using a knowledge-based approach. We incorporate linguistic knowledge about sign language into a sign spotting system and evaluation method. We aim to enhance explainability by enabling a sign spotting analysis based on linguistic insights. Furthermore, we develop linguistic features to ensure our model uses knowledge-based inputs as the basis for its decision-making. In this way, we hope to increase the explainability of current methods.\n\n To address the need for explainable sign spotting systems, we implemented features for a sign spotting model that approximate the basic phonological properties of signs, including handshape, orientation, location and movement [7], [8]. Our features are extracted from landmarks, which are keypoints in the body, such as the fingertips and shoulders, that we detected using a landmark detection tool. As far as we are aware, we are the first to implement a sign spotting model which extracts such features from landmarks. By taking into account the basic four phonological properties, we aim to create explainable sign representations for our model to encode. As a result, it is possible to perform a failure analysis for our model that is facilitated by the linguistic features.\n\n To address the need for explainable evaluation methods for sign spotting, we developed an evaluation that is rooted in the concept of tolerance to irrelevance (TTI) [9]. TTI builds on the assumption that users, given an entry point in a video or audio stream, keep watching or listening until their tolerance to irrelevant content is reached. Through this means, our evaluation method reflects the effort it takes for users to use a sign spotting system.\n\n However, TTI, like existing sign spotting evaluations, relies on a full ground truth to reliably determine a model’s performance, which may not be available for a sign spotting dataset. We address this limitation by using a novel approach that uses only the most challenging known cases to assess our model performance. These hardest cases are called distractors, which we define as those signs that are most similar to the target sign based on a distance measure. In our work, we develop a novel linguistic distance measure to determine the similarity between signs. Through the usage of these distractors, we estimate the performance for the full ground truth based solely on the hardest cases from the known ground truth, and assume that this makes our performance estimation robust to the addition of new annotations. We validated this assumption by investigating the effects of updates to the annotations on the performance estimates by our distractor-based evaluation compared to a baseline evaluation that uses random, as opposed to hard, cases. Our results show that the distractor-based evaluations provides a more conservative estimate of the performance of a model and is comparably robust to changes in the annotations compared to the baseline.\n\n We validated our linguistic features using an empirical analysis, where we compared the effectiveness of a non-linguistic baseline that used landmarks directly, to a model using our more explainable, linguistically motivated features that are extracted from landmarks. Moreover, we investigated whether a combination of linguistic and baseline landmark features would result in better performance. The conditions were compared through the use of our distractor-based evaluation. We determined that the combination of the features provided the best performance at the cost of linguistic representativeness.}\n}\n\n","author_short":["Hollain, N."],"key":"hollain:23c","id":"hollain:23c","bibbaseid":"hollain-towardsexplainablesignspottingsystemsanexplorationofapproximativelinguisticfeaturesandevaluationmethods-2023","role":"author","urls":{"Paper":"https://www.signlab-amsterdam.nl/publications/Natalie-Hollain-thesis.pdf"},"metadata":{"authorlinks":{}}},"bibtype":"mastersthesis","biburl":"www.signlab-amsterdam.nl/bib/signlab.bib","dataSources":["R8Za3BGYPnYDXtHHL"],"keywords":[],"search_terms":["towards","explainable","sign","spotting","systems","exploration","approximative","linguistic","features","evaluation","methods","hollain"],"title":"Towards Explainable Sign Spotting Systems: an Exploration of Approximative Linguistic Features and Evaluation Methods","year":2023}