Patrol Data Validation
Some of the questions I get about problems in SMART are a result of errors in patrol data: erroneous waypoints, or invalid tracks. A long time ago we talked about data validation; we should consider this again. We could do a variety of things here but a very simple check might be to ensure that all points are within x KM of the conservation area boundary.
Leave a comment
Ideas for implementation:
-Somewhere (perhaps under conservation area properties) set either an AOI, outside of which points would be flagged/deleted/ignored, or set a maximum distance from the CA boundary for this.
-Set a maximum speed from track point to track point which would cause points to be flagged/deleted/ignored.
-Implement something akin to a “data cleaning query” interface, where a user can run a query, then select points for deletion/editing.
-Somewhere (perhaps under conservation area properties) set either an AOI, outside of which points would be flagged/deleted/ignored, or set a maximum distance from the CA boundary for this.
-Set a maximum speed from track point to track point which would cause points to be flagged/deleted/ignored.
-Implement something akin to a “data cleaning query” interface, where a user can run a query, then select points for deletion/editing.
I think we'd also want to flag patrols with 0km distance (i.e. if a patrol tracklog wasn't present)
I would see this as a 'patrol data validation' feature in the patrol window interface (perhaps as an additional tab) as it would primarily be used to validate just after patrol import from CT or patrol manual data entry
I would see this as a 'patrol data validation' feature in the patrol window interface (perhaps as an additional tab) as it would primarily be used to validate just after patrol import from CT or patrol manual data entry
from Emma's email Oct 2, 2016, additional ideas for automated data verification:
"i)Checks for patrols with no coordinates and no data (typically errors when using the CT plug-in).
ii) Check for duplications (..similar to) when you add a new employee "
re: ii) we do check for patrol duplicates now, ie if you are loading the same patrol again it should warn you that it looks like a duplicate. Maybe this is in reference to a duplicate observation. This could be tricky as you might see a lot more false-duplicates than actual duplicates and just bog down the data loading process?
"i)Checks for patrols with no coordinates and no data (typically errors when using the CT plug-in).
ii) Check for duplications (..similar to) when you add a new employee "
re: ii) we do check for patrol duplicates now, ie if you are loading the same patrol again it should warn you that it looks like a duplicate. Maybe this is in reference to a duplicate observation. This could be tricky as you might see a lot more false-duplicates than actual duplicates and just bog down the data loading process?
RRI Discussion:
This is open-ended and depends on what we implement.
The steering committee discussed the speed option as the best indicator for finding bad points. i.e. For ground and water patrols anything that moves more than 150km/hr is a bad point and should be reported to the import user as probable bad data. For air patrols we increase the threshold to 1000km/hr.
Ideas:
-Possibly make this all configurable: for each transport type have a max speed to check?
-Check for points outside conservation area boundary?
-Might need a UI to manually run this validation and/or save configuration for running things automatically[RAB1] .
RRI Update: This needs a lot of thought. We would like to know the precise list of checks that are required. And what should happen if a bad point is found? [In February 2015 Refractions proposed a comprehensive solution – please see the Appendix “Data Validation” of this document. We would want to revisit this and design it properly.]
[RAB1]Yes, I think this needs to be configurable. Speeds will be different for foot, vehicle, air, etc. Also, would be good to be able to manually assign the layer which is the boundary for possible points, rather than have it just be CA boundary.
From an email dated 3 February 2015:
“we propose making a plug-in that provides a reusable QA framework that includes roughly the following:
-A GUI that provides of list of all available "QA processes" that have been added
-Some tools to select, and run these algorithms on a selected set of data (patrols, incidents, missions. Using filters for things like transport type, and dates to let the user select a sub-set of data)
-A framework to show the results of these algorithms as a table of results
-All the plug-in packaging and code necessary to allow additional algorithms to be easily added etc.
-A QA algorithm that allows users to select any number of the built in shape files layers (CA boundary, buffer, administrative areas etc) and show all point that fall outside of all the selected Layers (The default would be the CA boundary + the CA Buffer I imagine). This process would then have all the points listed in the table and the user can select any number of the points and delete them from the SMART database.
-Another QA process users can run, that checks the speed of patrols based on time and distance between each point, then highlights any point pairings where the speed was over the user-entered maximum threshold[RAB1] .
-The ability to double-click a returned point and have it open the correct patrol and leg to allow users to inspect the point in more detail to quickly determine if the point is valid or not if necessary[RAB2] .
We've estimated the above tasks at about 3 weeks of effort. This would then allow us to add other algorithms that use the above framework quite easily, without any new UI components needed which are often time consuming.
Mapping Option:
We could also include a mapping window in the above framework to allow users to quickly and easily select and highlight points they wish to inspect and see them on a map in relation to the CA's 5 layers and all the other points found during the QA process. This would add about 1 week of effort, we think there is a reasonable chance this would be a useful tool for any QA processes where users will need to make a judgment call on validity and need as much information as possible.
Additional QA algorithm:
The other QA process we've discussed in the past is to clean patrol track data by removing numerous points that are all in the same area, mostly due to a gps device that was left on when the patrol had stopped to rest or sleep. This algorithm could be added to the above framework, but we are not sure of the exact method of detecting and solving this issue yet. We estimate about 5 days to detail and implement an approach to solving this issue and adding it as an available tool in the QA framework plug-in.”[RAB3]
[RAB1]I think this is the most important check. TO me the main issue is bad GPS points, rather than patrols conducted in the wrong place.
[RAB2]Overall I think this is a good approach. There needs to be significant customizability by transport type (i.e. different thresholds assigned to different speeds). There also needs to be a manual (as described here) and a fully automated option (where QA/outlier detection is performed automatically and outliers are interpolated to something more reasonable).
[RAB3]Yes, a smoothing feature like this would be good. Possibly based on averaging all points within a certain radius to a single point/calculating a centroid.
This is open-ended and depends on what we implement.
The steering committee discussed the speed option as the best indicator for finding bad points. i.e. For ground and water patrols anything that moves more than 150km/hr is a bad point and should be reported to the import user as probable bad data. For air patrols we increase the threshold to 1000km/hr.
Ideas:
-Possibly make this all configurable: for each transport type have a max speed to check?
-Check for points outside conservation area boundary?
-Might need a UI to manually run this validation and/or save configuration for running things automatically[RAB1] .
RRI Update: This needs a lot of thought. We would like to know the precise list of checks that are required. And what should happen if a bad point is found? [In February 2015 Refractions proposed a comprehensive solution – please see the Appendix “Data Validation” of this document. We would want to revisit this and design it properly.]
[RAB1]Yes, I think this needs to be configurable. Speeds will be different for foot, vehicle, air, etc. Also, would be good to be able to manually assign the layer which is the boundary for possible points, rather than have it just be CA boundary.
From an email dated 3 February 2015:
“we propose making a plug-in that provides a reusable QA framework that includes roughly the following:
-A GUI that provides of list of all available "QA processes" that have been added
-Some tools to select, and run these algorithms on a selected set of data (patrols, incidents, missions. Using filters for things like transport type, and dates to let the user select a sub-set of data)
-A framework to show the results of these algorithms as a table of results
-All the plug-in packaging and code necessary to allow additional algorithms to be easily added etc.
-A QA algorithm that allows users to select any number of the built in shape files layers (CA boundary, buffer, administrative areas etc) and show all point that fall outside of all the selected Layers (The default would be the CA boundary + the CA Buffer I imagine). This process would then have all the points listed in the table and the user can select any number of the points and delete them from the SMART database.
-Another QA process users can run, that checks the speed of patrols based on time and distance between each point, then highlights any point pairings where the speed was over the user-entered maximum threshold[RAB1] .
-The ability to double-click a returned point and have it open the correct patrol and leg to allow users to inspect the point in more detail to quickly determine if the point is valid or not if necessary[RAB2] .
We've estimated the above tasks at about 3 weeks of effort. This would then allow us to add other algorithms that use the above framework quite easily, without any new UI components needed which are often time consuming.
Mapping Option:
We could also include a mapping window in the above framework to allow users to quickly and easily select and highlight points they wish to inspect and see them on a map in relation to the CA's 5 layers and all the other points found during the QA process. This would add about 1 week of effort, we think there is a reasonable chance this would be a useful tool for any QA processes where users will need to make a judgment call on validity and need as much information as possible.
Additional QA algorithm:
The other QA process we've discussed in the past is to clean patrol track data by removing numerous points that are all in the same area, mostly due to a gps device that was left on when the patrol had stopped to rest or sleep. This algorithm could be added to the above framework, but we are not sure of the exact method of detecting and solving this issue yet. We estimate about 5 days to detail and implement an approach to solving this issue and adding it as an available tool in the QA framework plug-in.”[RAB3]
[RAB1]I think this is the most important check. TO me the main issue is bad GPS points, rather than patrols conducted in the wrong place.
[RAB2]Overall I think this is a good approach. There needs to be significant customizability by transport type (i.e. different thresholds assigned to different speeds). There also needs to be a manual (as described here) and a fully automated option (where QA/outlier detection is performed automatically and outliers are interpolated to something more reasonable).
[RAB3]Yes, a smoothing feature like this would be good. Possibly based on averaging all points within a certain radius to a single point/calculating a centroid.