39. Computer Vision Nearby Injection, het toevoegen van bekende Top10 soorten uit de omgeving

By Kueda
Let op de Nearby Injection en recnete forum posts

Blijkbaar wordt de TOP100 gebruikt
https://forum.inaturalist.org/t/identification-quality-on-inaturalist/7507

I gave a talk on data quality on iNaturalist at the Southern California Botanists 2019 symposium recently, and I figured some of the slides and findings I summarized would be interesting to everyone, so here goes.

Accuracy of Identifications in Research Grade Observations

Some of you may recall we performed a relatively ad hoc experiment to determine how accurate identifications really are. Scott posted some of his findings from that experiment in blog posts (here and here), but I wanted to summarize them for myself, with a focus on how accurate “RG” observations are, which here I’m defining as obs that had a species-level Community Taxon when the expert encountered them. Here’s my slide summarizing the experiment:

And yes, https://github.com/kueda/inaturalist-identification-quality-experiment/blob/master/identification-quality-experiment.ipynb does contain my code and data in case anyone wants to check my work or ask more questions of this dataset.

So again, looking only at expert identifications where the observation already had a community opinion about a species-level taxon, here’s how accuracy breaks down for everything and by iconic taxon:



Some definitions
  • accurate: identifications where the taxon the expert suggested was the same as the existing observation taxon or a descendant of it
  • inaccurate: identifications where the taxon the expert suggested was not same as the existing observation taxon and was also not a descendant or ancestor of that taxon
  • too specific: identifications where the taxon the expert suggested was an ancestor of the observation taxon
  • imprecise: identifications where the taxon the expert suggested was a descendant of the observation taxon

Close readers may already notice a problem here: my filter for “RG” observation is based on whether or not we think the observation had a Community Taxon at species level at the time of the identifications, while my definitions of accuracy are based on the observation taxon. Unfortunately, while we do record what the observation taxon was at the time an identification gets added, we don’t record what the community taxon, so we can’t really differentiate between RG obs and obs that would be RG if the observer hadn’t opted out of the Community Taxon. I’m assuming those cases are relatively rare in this analysis.

Anyway, my main conclusions here are that

  • about 85% of Research Grade observations were accurately identified in this experiment
  • accuracy varies considerably by taxon, from 91% accurate in birds to 65% accurate in insects

In addition to the issues I already raised, there were some serious problems here:



Since I was presenting to a bunch of Southern California botanists, I figured I’d try repeating the analysis assuming some folks in the audience were infallible experts, so I exported identifications by jrebman, naomibot, and keirmorse (all SoCal botanists I trust) and made the same chart:

jrebman has WAY more IDs in this dataset than either of the other two botanists, and he’s added way more identifications than were present in the 2017 Identification Quality Experiment. I’m not sure if he’s infallible, but he’s a well-established systematic botanist at the San Diego Natural History Museum, so he’s probably as close to an infallible identifier as we can get.

Anyway, note that we’re a good 8-9 percentage points more accurate here. Maybe this is due to a bigger sample, maybe this is due to Jon’s relatively unbiased approach to identifying (he’s not looking for Needs ID records or incorrectly identified records, he just IDs all plants within his regions of interest, namely San Diego County and the Baja peninsula), maybe this pool of observations has more accurate identifiers than observations as a whole, maybe people are more interested in observing easy-to-identify plants in this set of parameters (doubtful). Anyway, I find it interesting.

That’s it for identification accuracy. If you know of papers on this or other analyses, please include links in the comments!

Accuracy of Automated Suggestions

I also wanted to address what we know about how accurate our automated suggestions are (aka vision results, aka “the AI”). First, it helps to know some basics about where these suggestions come from. Here’s a schematic:

The model is a statistical model that accepts a photo as input and outputs a ranked list of iNaturalist taxa. We train the model on photos and taxa from iNaturalist observations, so the way it ranks that list of output taxa is based on what it’s learned about what visual attributes are present in images labeled as different taxa. That’s a gross over-simplification, of course, but hopefully adequate for now.

The suggestions you see, however, are actually a combination of vision model results and nearby observation frequencies. To get those nearby observations, we try to find a common ancestor among the top N model results (N varies with each new model, but in this figure N = 3). Then we look up observations of that common ancestor within 100km of the photo being tested. If there are observations of taxa in those results that weren’t in the vision results, we inject them into the final results. We also re-order suggestions based on their taxon frequencies.

So with that summary in mind, here’s some data on how accurate we think different parts of this process are.

Model Accuracy (Vision only)



There are a lot of ways to test this, but here we’re using photos of taxa the model trained on exported at the time of training but not included in that training as inputs, and “accuracy” is how often the model recommends the right taxon for those photos as the top result. We’ve broken that down by iconic taxon and by number of training images. I believe the actual data points here are taxa and not photos, but Alex can correct me on that if I’m wrong.

So main conclusions here are

  1. Median accuracy is between 70 and 85% for taxa the model knows about
  2. Accuracy varies widely within iconic taxa, and somewhat between iconic taxa
  3. Number of training images makes a difference (generally more the better, with diminishing returns)

Overall Accuracy (Vision + Nearby Obs)



This chart takes some time to understand, but it’s the results of tests we perform on the whole system, varying by method of defining accuracy (top1, top10, etc) and common ancestor calculation parameters (what top YY results are we looking at for determining a common ancestor, what combined vision score threshold do we accept for a common ancestor).

My main conclusions here are

  1. The common ancestor, i.e. what you see as “We’re pretty sure it’s in this genus,” is very accurate, like in the 95% range
  2. Top1 accuracy is only about 64% when we include taxa the model doesn’t know about. That surprised me b/c anecdotally it seems higher, but keep in mind this test set includes photos of taxa the model doesn’t know about (i.e. it cannot recommend the right taxon for those photos), and I’m biased toward seeing common stuff the model knows about in California
  3. Nearby observation injection helps a lot, like 10 percentage points in general

Conclusions

  1. Accuracy is complicated and difficult to measure
  2. What little we know suggests iNat RG observations are correctly identified at least 85% of the time
  3. Vision suggestions are 60-80% accurate, depending on how you define “accurate,” but more like 95% if you only accept the “we’re pretty sure” suggestions

Hope that was interesting! Another conclusion was that I’m a crappy data scientist and I need to get more practice using iPython notebooks and the whole Python data science stack.

https://forum.inaturalist.org/t/identification-quality-on-inaturalist/7507

  • AI Model 7 . July 2021

    The number of taxa included in the model went from almost 25,000 to over 38,000. That’s an increase of 13,000 taxa compared to the last model, which, to put in perspective, is more than the total number of bird species worldwide. The number of training photos increased from 12 million to nearly 21 million.
    Accuracy
    Accuracy outside of North America has improved noticeably in this model. We suspect this is largely due to the nearly doubling of the data driving this model in addition to recent international growth in the iNaturalist community. We’re continuing to work on developing a better framework for evaluating changes in model accuracy, especially given tradeoffs among global and regional accuracy and accuracy for specific groups of taxa.

    The recent changes removing non-nearby taxa from suggestions by default have helped reduce this global-regional accuracy tradeoff, but there’s still more work to do to improve how computer vision predictions are incorporating geographic information.
    https://www.inaturalist.org/blog/54236-new-computer-vision-model
    Participate in the annual iNaturalist challenges: Our collaborators Grant Van Horn and Oisin Mac Aodha continue to run machine learning challenges with iNaturalist data as part of the annual Computer Vision and Pattern Recognition conference. By participating you can help us all learn new techniques for improving these models.

    Start building your own model with the iNaturalist data now: If you can’t wait for the next CVPR conference, thanks to the Amazon Open Data Program you can start downloading iNaturalist data to train your own models now. Please share with us what you’ve learned by contributing to iNaturalist on Github.

  • BIODIV Next
    Een presentatie hoe een be-nl model samengesteld is https://observation.org/download/Biodiv%20Next%20-%20Dutch_Belgian%20species%20ID%20.pptx
    Hierarchisch Model Ensemble is nauwkeuriger dan een singe model, mogelijk omdat bij 16.000 soorten te veel keuzes gemaakt moeten worden (Inception-v1, Inception-v3,Inception-v4, ResNet-18,, ResNet-34 , ResNet-101, GoogleLeNet, BN-NIN, GG-10)
    Performance vs Voorkomen

  • https://www.inaturalist.org/blog/archives/2022/05
    https://www.inaturalist.org/blog/66531-we-ve-passed-100-000-000-verifiable-observations-on-inaturalist
    https://www.inaturalist.org/blog/63931-the-latest-computer-vision-model-updates
  • AI Model 8 . May 2022

    In 2017 the amount of recognised species was 20.000 and now it is still.....20.000?
    https://www.inaturalist.org/pages/help#cv-taxa
    FWIW, there's also discussion and some additional charts at https://forum.inaturalist.org/t/psst-new-vision-model-released/10854/11
    https://forum.inaturalist.org/t/identification-quality-on-inaturalist/7507
    AI Model 5 . July 2019 included 16,000 taxa and 12 million training photos.
    AI Model 6 . July 2020 included 25,000 taxa and xx million training photos.
    AI Model 7 . July 2021 included 38,000 taxa and 21 million training photos. Training job in October 2021, we planned to train a AI Model 8 . May 2022 on 47,000 taxa and 25 million training images but finished with er 55,000 taxa and over 27 million training images.
    March 2020
    https://www.inaturalist.org/blog/31806-a-new-vision-model
    Juli 2021
    https://www.inaturalist.org/posts/54236-new-computer-vision-model
    2022
    https://www.inaturalist.org/blog/63931-the-latest-computer-vision-model-updates
    https://stackoverflow.com/questions/44860563/can-vgg-19-fine-tuned-model-outperform-inception-v3-fine-tuned-model
    uli 2021 Model 7
    https://www.inaturalist.org/posts/54236-new-computer-vision-model
    Sept 2022 Model 9 Sept 2022
    https://www.inaturalist.org/blog/63931-the-latest-computer-vision-model-updates
    https://stackoverflow.com/questions/44860563/can-vgg-19-fine-tuned-model-outperform-inception-v3-fine-tuned-model
    https://stackoverflow.com/questions/44860563/can-vgg-19-fine-tuned-model-outperform-inception-v3-fine-tuned-model
    https://www.inaturalist.org/posts/54236-new-computer-vision-model
    https://www.inaturalist.org/blog/69958-a-new-computer-vision-model-including-4-717-new-taxa
    Okt 2022 Model 10 Okt 2022
    https://www.inaturalist.org/blog/71290-a-new-computer-vision-model-including-1-368-new-taxa-in-37-days

    Model v1.3 (Nr 11), Oktober 2022 has 66,214 taxa, up from 64,884. https://www.inaturalist.org/blog/71290-a-new-computer-vision-model-including-1-368-new-taxa-in-37-days

    This new model (v1.3) is the second we’ve trained in about a month using the new faster approach, but it’s the first with a narrow ~1 month interval between the export of the data it was trained on and the export of the data the model it is replacing (v1.2) was trained on. The previous model (v1.2) was replacing a model (v1.1) trained on data exported in April so there was a 4 month interval between these data exports (interval between A and B in the figure below). This 4 month interval is why model 1.2 added ~5,000 new taxa to the model. The new model (v1.3) was trained on data exported just 37 days after the data used to train model 1.2 (interval between B and C in the figure below) and added 1,368 new taxa.



    https://nofreehunch.org/2023/08/09/image-classification-in-the-real-wild/
    https://forum.inaturalist.org/t/what-i-learned-after-training-my-own-computer-vision-model-on-inats-data/44052
    ooks like the web app, shut down some time back. I have restarted it and updated the links.
    Just in case, this is address http://35.224.94.168:8080/ 45 (the ip address should not change)
    This app visualizes model predictions for a Computer Vision model trained on iNaturalist data

    You can read more about how this model was trained here

    Here is a rough guide to use this app :

    Look at the predictions on a Random image from the validation set
    Look at the Accuracy Summary for different taxonomic groups
    For example to look at the summary by Kingdom
    I personally find the summary by Order most useful
    Look at the errors at different levels in the taxonomic heirarchy.
    For example to look at errors where the model got the Kingdom wrong !
    For example to look at errors where the the model got the Species wrong
    This is a personal project by Satyajit Gupte
    http://35.224.94.168:8080/about
    I would be happy to hear anything you have to say. You can reach me at gupte.satyajit@gmail.com or on iNat
    https://nofreehunch.org/2023/08/09/image-classification-in-the-real-wild/
    http://35.224.94.168:8080/
    https://nofreehunch.org/2023/07/24/make-the-most-of-your-gpu/
    https://nofreehunch.org/2023/03/22/ads-auction/
    https://nofreehunch.org/about-me/

    https://forum.inaturalist.org/t/better-use-of-location-in-computer-vision-suggestions/915/41

    Google provides three models that have been trained with iNaturalist data - classification models for plants, birds, and insects. These Google models can be downloaded and used with Google's TensorFlow and TensorFlow Lite tools.
    https://techbrij.com/setup-tensorflow-jupyter-notebook-vscode-deep-learning

    Publicado por optilete hace 9 meses

    https://techbrij.com/setup-tensorflow-jupyter-notebook-vscode-deep-learninghttps://techbrij.com/setup-tensorflow-jupyter-notebook-vscode-deep-learning

    Further Reading

    1. The Recipe from PyTorch
    2. A nice paper on tuning hyper-parameters. The same author also came up with cyclical learning rates.
    3. Trivial Auto Augment
    4. How label smoothing helps
    5. CutMix, another clever augmentation strategy , which I did not try out.
    6. Geo Prior Model that encodes location and time
    7. How biologists think about classification ! This is a very good read.

  • Publicado el diciembre 1, 2020 06:19 TARDE por ahospers ahospers

    Comentarios

    We're currently training a new model based on an export in September that had ~18 million images of 35k+ taxa. It's running with the same setup that we've used on previous models, but with a lot more data, so it will probably take ~210 days and be done some time next Spring. We're simultaneously experimenting with an updated system (TensorFlow 2, Xception vs Inception) that seems to be much faster, e.g. it seems like it might do the same job in 40-60 days, so if it seems like the new system performs about the same as the old one in terms of accuracy, inference speed, etc., we might just switch over to that and have a new model deployed in January or February 2021.

    FWIW, COVID has kind of put a hitch in our goal of training 2 models a year. We actually ordered some new hardware right before our local shelter in place orders were issued, and we didn't feel the benefit of the new hardware outweighed the COVID risk of spending extended time inside at the office to assemble everything and get it running. Uncertainty about when it would be safe to do so was part of why we didn't start training a new model in the spring (that and the general insanity of the pandemic), but eventually we realized things weren't likely to get much better any time soon so we just started a new training job on the old system.

    The Academy is actually open to the public again now, with fairly stringent admission protocols for both the public and staff, so we might decide to go in an build out that new machine, but right now we're still continuing with this training job and experimenting with newer, faster software at home.

    https://www.inaturalist.org/posts/59122-new-vision-model-training-started

    Publicado por ahospers hace más de 3 años

    We (July 2021) 38,000 to 47,000 taxa (Oct-Dec2021) , and from 21 million(July 2021) to 25 million(Oct-Dec2021) training photos.
    https://www.inaturalist.org/posts/59122-new-vision-model-training-started

    Publicado por ahospers hace casi 3 años

    https://nofreehunch.org/2023/08/09/image-classification-in-the-real-wild/
    https://forum.inaturalist.org/t/what-i-learned-after-training-my-own-computer-vision-model-on-inats-data/44052
    ooks like the web app, shut down some time back. I have restarted it and updated the links.
    Just in case, this is address http://35.224.94.168:8080/ 45 (the ip address should not change)
    This app visualizes model predictions for a Computer Vision model trained on iNaturalist data

    You can read more about how this model was trained here

    Here is a rough guide to use this app :

    Look at the predictions on a Random image from the validation set
    Look at the Accuracy Summary for different taxonomic groups
    For example to look at the summary by Kingdom
    I personally find the summary by Order most useful
    Look at the errors at different levels in the taxonomic heirarchy.
    For example to look at errors where the model got the Kingdom wrong !
    For example to look at errors where the the model got the Species wrong
    This is a personal project by Satyajit Gupte
    http://35.224.94.168:8080/about
    I would be happy to hear anything you have to say. You can reach me at gupte.satyajit@gmail.com or on iNat
    https://nofreehunch.org/2023/08/09/image-classification-in-the-real-wild/
    http://35.224.94.168:8080/
    https://nofreehunch.org/2023/07/24/make-the-most-of-your-gpu/
    https://nofreehunch.org/2023/03/22/ads-auction/
    https://nofreehunch.org/about-me/

    Publicado por ahospers hace alrededor de 1 año

    https://nofreehunch.org/2023/08/09/image-classification-in-the-real-wild/
    https://forum.inaturalist.org/t/what-i-learned-after-training-my-own-computer-vision-model-on-inats-data/44052
    ooks like the web app, shut down some time back. I have restarted it and updated the links.
    Just in case, this is address http://35.224.94.168:8080/ 45 (the ip address should not change)
    This app visualizes model predictions for a Computer Vision model trained on iNaturalist data

    You can read more about how this model was trained here

    Here is a rough guide to use this app :

    Look at the predictions on a Random image from the validation set
    Look at the Accuracy Summary for different taxonomic groups
    For example to look at the summary by Kingdom
    I personally find the summary by Order most useful
    Look at the errors at different levels in the taxonomic heirarchy.
    For example to look at errors where the model got the Kingdom wrong !
    For example to look at errors where the the model got the Species wrong
    This is a personal project by Satyajit Gupte
    http://35.224.94.168:8080/about
    I would be happy to hear anything you have to say. You can reach me at gupte.satyajit@gmail.com or on iNat
    https://nofreehunch.org/2023/08/09/image-classification-in-the-real-wild/
    http://35.224.94.168:8080/
    https://nofreehunch.org/2023/07/24/make-the-most-of-your-gpu/
    https://nofreehunch.org/2023/03/22/ads-auction/
    https://nofreehunch.org/about-me/

    https://forum.inaturalist.org/t/better-use-of-location-in-computer-vision-suggestions/915/41

    Google provides three models that have been trained with iNaturalist data - classification models for plants, birds, and insects. These Google models can be downloaded and used with Google's TensorFlow and TensorFlow Lite tools.
    https://techbrij.com/setup-tensorflow-jupyter-notebook-vscode-deep-learning

    Publicado por optilete hace 9 meses

    https://techbrij.com/setup-tensorflow-jupyter-notebook-vscode-deep-learninghttps://techbrij.com/setup-tensorflow-jupyter-notebook-vscode-deep-learning

    Further Reading

    The Recipe from PyTorch

    A nice paper on tuning hyper-parameters. The same author also came up with cyclical learning rates.

    Trivial Auto Augment

    How label smoothing helps

    CutMix, another clever augmentation strategy , which I did not try out.

    Geo Prior Model that encodes location and time

    How biologists think about classification ! This is a very good read.

    Publicado por ahospers hace alrededor de 1 año

    Agregar un comentario

    Acceder o Crear una cuenta para agregar comentarios.