AFreeCa: Annotation-Free Counting for All

Abstract

Object counting methods typically rely on manually annotated datasets. The cost of creating such datasets has restricted the versatility of these networks to count objects from specific classes (such as humans or penguins), and counting objects from diverse categories remains a challenge. The availability of robust text-to-image latent diffusion models (LDMs) raises the question of whether these models can be utilized to generate counting datasets. However, LDMs struggle to create images with an exact number of objects based solely on text prompts but they can be used to offer a dependable sorting signal by adding and removing objects within an image. Leveraging this data, we initially introduce an unsupervised sorting methodology to learn object-related features that are subsequently refined and anchored for counting purposes using counting data generated by LDMs. Further, we present a density classifier-guided method for dividing an image into patches containing objects that can be reliably counted. Consequently, we can generate counting data for any type of object and count them in an unsupervised manner. AFreeCA outperforms other unsupervised and few-shot alternatives and is not restricted to specific object classes for which counting data is available.

The Challenge of Generating Counting Data with Latent Diffusion Models

Traditional object counting methods rely heavily on manually annotated datasets, which are costly and time-consuming to create. Text-to-image latent diffusion models (LDMs) like Stable Diffusion offer a promising alternative by generating synthetic images from text prompts, potentially reducing the need for manual annotations. However, these models struggle to produce images with an exact number of objects as specified, leading to inconsistencies between the intended and actual counts. Figure 1 illustrates this issue: when a prompt specifies a certain object count (e.g., 20), Stable Diffusion often generates images with a similar but incorrect number of objects. Moreover, as the prompt count increases, the relative error between the true count and the prompt count also increases, indicating that synthetic data becomes less reliable for higher object counts. This highlights the challenge of using synthetic data from LDMs for accurate object counting.

Figure 1. Left: when given a prompt count of 20, Stable Diffusion outputs images with a similar but often incorrect object count. Right: as the prompt count increases, the relative error between the true underlying count and the prompt count increases

Methodology

Our proposed methodology leverages synthetic data generated by latent diffusion models (LDMs) to improve object counting across diverse categories. We first generate highly reliable synthetic sorting data by adding and removing objects in images, creating ordered image triplets based on object count. These triplets are used to train a sorting network that captures robust counting features. Subsequently, we use LDMs to generate noisy synthetic counting data to fine-tune the counting network, anchoring the learned sorting features to actual count values. To enhance accuracy in dense images, we employ a density classifier that partitions images into manageable patches. Figure 2 illustrates our workflow: starting with simple prompts to create synthetic training data, we train a sorting model, a density classifier, and a count anchoring network. During inference, the density classifier guides the partitioning of dense images, allowing for precise counting even in regions with high object density. This approach ensures accurate counting across varied object categories without the need for manual annotations.

Figure 2. Our strategy involves three distinct steps supported by a synthetic training signal extracted from stable diffusion.

BibTeX

@article{d2024afreeca,
      title={AFreeCA: Annotation-Free Counting for All},
      author={D'Alessandro, Adriano and Mahdavi-Amiri, Ali and Hamarneh, Ghassan},
      journal={arXiv preprint arXiv:2403.04943},
      year={2024}
    }