Counting in the Age of Crowds

Learn how to count large crowds in milliseconds with Artificial Intelligence

26 min readMar 2, 2021

Introduction

The impact of crowds on history is undeniable. Throughout time, the world has seen the rise of religions, the fall of empires, and the birth of nations, all marked by the rising activity of crowds. Today, we are witnessing a resurgence of crowds, driven by a variety of socioeconomic and political factors. While it’s impossible to predict the future, it’s clear that we are in the midst of an era of heightened crowd activity. Only time will reveal what lies ahead.

Recent advancements in Artificial Intelligence and Computer Vision have given us new ways to view ourselves, in all of our glory and imperfection. In this article I’ll go over how I used AI to quickly and accurately view one of the most important features of a crowd: its size. This technology, along with other tech, may someday give us a clearer picture of how crowd activity influences the world.

If you just want to see cool crowd counting pictures of recent events, scroll down to the Counting Recent Events section. If you want to know how this is actually done, stay tuned.

Quick Overview

The goal of this project was to quickly and accurately count crowds in images and videos. To accomplish this goal, I trained an encoder-decoder neural network that takes a crowd image as an input and returns a density map of the crowd.

Figure 1: Crowd image and its corresponding density map

The density map gives us a useful visualization of head location, head size and overall crowd density. The integral of the density map is the crowd count, but more on this in a minute.

Now that you know the basic concept, let’s get into the details.

Making Sense of the Data

Before training the model, it is essential that we preprocess the data in a way that gives the model the best chance at making the correct prediction.

For training data, I used the popular ShanghaiTech crowd dataset. You can download the data here. This dataset is separated into parts A and B. The A dataset contains larger crowd sizes and the B dataset contains smaller crowd sizes.

The ShanghaiTech dataset gives us everything we need to get started — over a thousand images and their ground truth annotations (i.e. the coordinates of every human head in the image).

Figure 2: ShanghaiTech image with ground truth points

Each ground truth point in the image can be represented by a Dirac delta function:

Thus an image with N heads labeled can be represented by:

Gaussian Kernels

The building blocks for this project are Gaussian Kernels. These kernels are based on the famous Gaussian function, which describes a normal distribution in statistics.

Interestingly, if you take a vector of outputs from the Gaussian function and multiply it by the transpose of the same output vector, you get a Gaussian Kernel matrix.

The Gaussian Kernel is most often used to smooth or blur images through convolution. However, for a crowd counting project, Gaussian Kernels are useful for two reasons: 1) they are circular like the shape of a human head and 2) the sum of a Gaussian Kernel matrix is one.

Let’s talk about the second point in a bit more detail. The academic literature refers to the crowd count as the integral of the density map. In calculus, integrals are most commonly thought of as the area/volume under a curve/surface, but it’s weird to think in those terms when we are looking at a flat image.

It makes a lot more sense when we give each pixel in the Gaussian Kernel a height. Now we can clearly see the Gaussian Kernel shape. The volume under this surface equals one — a great way to count exactly one head in an image.

Figure 5: Gaussian Kernel Surface

The Gaussian Kernel can be represented by the function:

Geometry-Adaptive Kernels

To give our model a better chance of predicting accurate crowd sizes, we need a way to correct for the perspective distortion that occurs from representing a 3D scene with a 2D image. In other words, depending on perspective, some heads will be small and some will be large, and we need a way to convey that information to the model.

The idea of geometry-adaptive kernels was first introduced in the 2016 paper Single-Image Crowd Counting via Multi-Column Convolutional Neural Network. The idea behind geometry-adaptive kernels is simple: heads appearing smaller in an image tend to be closer to nearest neighbors and heads appearing larger tend to be further from nearest neighbors.

Let’s revisit the image we saw above, except this time we’ll focus on only a few heads. Each ground truth point has red arrows drawn to its three nearest neighbors. The yellow circle represents the size of the Gaussian Kernel. You’ll notice that when the average distance to the nearest neighbors shrinks, so does the radius of the Gaussian Kernel.

Figure 6: Size of the Gaussian Kernel depends on average distance to nearest neighbors

Obviously this method is not perfect as there could be a small head in the background far from other heads. But it still adds some interesting context.

If we denote k nearest neighbors as:

Then the average distance is:

The average distance is then multiplied by a constant (β) to arrive at the radius of the Gaussian Kernel:

The number of nearest neighbors and β can be tinkered with. For my preprocessing, I used 3 nearest neighbors and 0.5 for β.

Density Maps

Now that we’ve learned about Gaussian Kernels and how to dynamically adjust their size, let’s take a look at how we can use them to build density maps. Density maps are what we want the neural network to reproduce. Density maps not only give us the crowd count but a useful visualization.

To build a density map, we first start with a 2D array (matrix) of zeros with the height and width of the original image. For each ground truth coordinate, we calculate the appropriate radius of the Gaussian Kernel based on its nearest neighbors and then add that kernel to the array centered on the ground truth point.

When heads in the image overlap, so do the Gaussian Kernels. This means that places in the image that have a high density of heads will have higher pixel values. And, as you can see from the density map below, higher pixel values are represented by the dark red spots. Giving the model the context of these varying densities will help it count heads that are slightly obscured.

Figure 7: Image and ground truth density map

We can get the entire crowd count by taking the integral of the density map. Again, if we want to visualize why the integral works, we can give each pixel value in the density map a height. The crowd count is equal to the volume under the surface.

Figure 8: Density Map Surface

Bringing together everything that we have defined so far, the density map can be represented by:

Every ground truth point (represented by the Dirac delta function) is convolved with a geometry-adaptive Gaussian Kernel. The academic literature refers to this as a convolution and, although it technically is, it can be confusing when reading a paper on Convolutional Neural Networks. Remember, this convolution is simply adding the geometry-adaptive Gaussian Kernel into a 2D array at the ground truth point. We’ll get into the other kind of convolutions in a minute.

Input Image Normalization

Before we get to the modeling, let’s quickly go over one more step in preprocessing. The density map is what we want the model to replicate, but we first have to input an image.

The pixel values of a color image typically range from 0 to 255. However, when dealing with neural networks, it’s better to use smaller numbers because of the exploding gradient problem. To get smaller numbers, we divide every pixel value in the image by 255. This gives us a pixel value range from 0 to 1.

Next we must consider the separate parts of a color image. A color image is actually three channels (matrices) of pixels, one channel for each primary color in the visible light spectrum — red, green and blue.

To finish normalization, we subtract a mean from each pixel to center the data around 0, then we divide each pixel by a standard deviation to ‘standardize’ the spread of the data . The mean and standard deviation depends on what channel we are dealing with. It is common practice to use the RGB means and standard deviations of a very large dataset called ImageNet. If this concept is still a bit confusing, read this post here that talks about the idea in more detail.

Building a Model to Predict Crowd Counts

In this section I will be going over the specifics of the neural network I used for this project. I will describe the model at a high level for people who are already familiar with Artificial Intelligence, but also delve into some of the basics for newcomers.

The model I used for this project came from the 2018 paper Scale Aggregation Network for Accurate and Efficient Crowd Counting. The SANet is an encoder-decoder Neural Network. During the model’s training, the encoder takes in an image of a crowd and learns the features of the image. After the encoder learns the image features, that information is passed to the decoder and the decoder attempts to reproduce the ground truth density map.

Architecture

The encoder (top half of Figure 10) is based on the Inception Network, which is a popular Convolutional Neural Network architecture. The encoder takes the color image as input. As you can see, the convolutional layers are broken down into four parts. The first number on one of the convolutional layer parts is the kernel size and the second number is the depth of the channels. For example, “Conv3_16” means the kernel size is 3x3 and there are 16 channels/feature maps.

The Inception architecture was designed to cut down on computational cost, but in this case it has another benefit. In crowd images, humans appear in multiple sizes and positions. With four different kernel sizes, the model can get a multi-scale perspective of the input image because, essentially, there are multiple sized windows scanning for features in the image.

The convolutions use “same” padding so the dimension of the output is not reduced during the operation. The data from every convolution is passed through an activation function. The channels for each of the four convolutional parts are combined with a concatenation layer, then the information contained in the feature maps is summarized in a max pooling layer.

The decoder (bottom half of Figure 10) is what produces the density map and is a bit simpler in structure. There are five standard convolutional layers (not Inception-like) and three transposed convolutional layers. Each convolutional layer, both standard and transposed, is followed by an activation function.

If things are still fuzzy, don’t worry. I will try to clear things up by going into greater detail below.

Convolutional Layers

Above I spoke a lot about convolutions, but what does that mean in this context? Convolutional layers are where the model learns the details (features) of an image.

At a low level, the concept behind convolutional layers is quite simple. Take a minute to study the movements of the GIF in Figure 11.

Think of the large blue matrix on the left as an image. The numbers inside the image are the pixel values. The shaded window moving across the image is a 3x3 kernel. The numbers inside the kernel are called weights. The green matrix on the right is the output, or feature map.

Figure 11: Convolution operation in a convolutional layer (source)

The math to produce the output is easy — it’s element-wise multiplication and then addition. Consider the top-left position of the kernel on the image. If we multiply element-wise and then add everything together, we get:

(3x0) + (3x1) + (2x2) + (0x2) + (0x2) + (1x0) + (3x0) + (1x1) + (2x2) =

0 + 3 + 4 + 0 + 0 + 0 + 0 + 1 + 4 =

This number corresponds to the top-left pixel on the feature map.

So, why are we doing this again?

By adjusting the weights in the kernel we can uncover information in an image that will help the model make accurate predictions. Let’s look at a simple but real example to try to make more sense of this concept.

Consider this kernel and its weights:

Figure 12: Sobel Kernel for vertical edge detection

Instead of sliding this kernel across the blue matrix we saw above, we’ll slide it over this image:

Figure 13: Example input image for simple convolution

And here is the output (feature map):

Figure 14: Feature map showing vertical edges from input image

We can see that the convolution of the kernel and the image gives us an output that highlights the edges of objects in the image — in this case people. Also notice how the edges are more pronounced in the vertical direction.

Let’s change the weights in the kernel and do the convolution again. This time we’ll use the following weights:

Figure 15: Sobel Kernel for horizontal edge detection

And here is the output:

Figure 16: Feature map showing horizontal edges from input image

Again, edges are being highlighted, but this time the edges are more pronounced in the horizontal direction.

By changing the weights inside the kernel, we can retrieve useful information from the image. This is what helps the neural network determine what’s in the image —for this project, how many heads are in an image and where those heads are located.

It’s important to keep in mind that the above is a very simple example of what happens in a convolutional layer. When a Convolutional Neural Network is training, what it’s actually doing is changing the kernel weights to improve model accuracy, and the weight combinations get very complex. The kernels could be detecting edges, shapes or patterns. It’s hard to say. In fact, understanding what happens inside of neural networks (interpretability) is still a major area of research.

Another thing to consider is that a convolution can have multiple kernels, which produces multiple outputs. These outputs/feature maps are the channels referred to in the Architecture section above. This is why we see depth to the output of the convolutional layer in Figure 17. These are feature maps stacked together. We’ll talk about why there are four section to the output in a minute.

Figure 17: Output from a convolutional layer showing channel depth

The entirety of the convolutional layer is a bit more complex than what is shown in Figure 10. If we zoom into a convolutional layer and rotate it 90 degrees, we see the following:

Figure 18: Detailed view of a convolutional layer in SANet

When data comes from the previous layer, it is first passed through four separate 1x1 convolutions. It may seem strange to have 1x1 kernels, but this is a great feature of the Inception-like architecture. I won’t go into great detail here, but the 1x1 convolutions are used to reduce the depth of channels which significantly reduces computational cost.

The outputs from the 1x1 convolutions are then passed through activation functions. After that, the data is passed to 3x3, 5x5 and 7x7 convolutions. These different kernel sizes are used to get information from an image on multiple scales. Again, people can appear in many different sizes in an image, so these different size kernels help identify people on multiple scales.

Finally the data is passed through more activation functions. Before the data reaches the next convolutional layer, it goes through a concatenation layer and a max pooling layer.

Activation Function, Concatenation & Max Pooling

So we’ve gone over convolutions in detail, but some of the concepts within the convolutional layer may still be a bit fuzzy.

Activation functions determine the strength of the connections in a neural network. If the activation function outputs a relatively large number, the connection to the subsequent part of the network is strong. If the activation function outputs a relatively small number, the connection to the subsequent part of the network is weak.

This concept is easier to visualize in fully connected layers but the idea is the same. Take a look at the fully connected layer in Figure 19. Data is being passed from one set of neurons to the next. Some connections are strong (bold) and some are weak. The strength of the connection depends on the output of the activation function, which is calculated before the data is sent to the next layer of neurons.

Figure 19: Fully-connected layer showing weak and strong connections

You can take this concept and apply it to the convolutional layer in Figure 18. But instead of only thinking in terms of neurons, think in terms of pixels as well. This wiring, along with the kernel weights, is what determines the model’s accuracy. The activation function used throughout the model is ReLU.

After the convolutions and activation functions, we get to a concatenation layer. This layer takes all the feature maps/channels produced by the four multi-scale convolutions and combines them into one. This is why Figure 17 was broken into four sections.

Finally we arrive at the max pooling layer. In the max pooling layer, we slide a kernel along the input like we did for the convolutions. But instead of doing element-wise multiplication, we simply take the maximum value within the window. This does two things 1) it helps us only focus on the most prominent features within an image and 2) it reduces dimensionality of the data and thus reduces computational cost. We can see that over the span of the encoder, the height and width dimensions shrink due to max pooling.

Figure 20: Height and width of data shrinks due to max pooling layers

Transposed Convolutions

We have now covered all the major concepts of the encoder part of the neural network. The decoder is simpler, but there is one more important concept we must consider: transposed convolutions.

Think about what happened to the shape of the data throughout the encoder. We reduced the width and height dimensions while learning the features of the image. However our goal isn’t merely to learn what’s in the image — it’s to produce a density map of the crowd. This is where transposed convolutions come into play.

Instead of reducing dimensionality, transposed convolutions increase dimensionality. This operation allows us to take the low resolution feature maps from the encoder and produce a high resolution density map.

To better understand this concept, let’s take a look at a simple example. In Figure 21 we can see an input and a kernel on the left. To start, take the blue (upper left) weight in the kernel and multiply it by every value in the input. That gives us all zeros in the upper left of the output. If we do the multiplication with the red (upper right) weight, we get a copy of the input in the upper right of the output because the red weight is one. If we do the same with the other weights and then do element-wise addition, we get the final output on the right.

Figure 21: Simple transposed convolution example

The transposed convolutional layers are why we see the height and width of the data expand throughout the decoder. Also notice the channel depth is decreasing throughout the decoder as well.

Figure 22: Height and width increasing throughout decoder due to transposed convolutions

There are regular convolutional layers in the decoder, although these layers are not as complex as the convolutional layers in the encoder. The decoder convolutional layers are not multi-scale. There is no concatenation or max pooling either, but each layer, including the transposed convolutions, is followed by a ReLU activation function.

The kernel weights in the decoder are not used to find image features like they are in the encoder, but rather to manipulate the pixel values in a way that most accurately produces the density map.

Loss Function

Once a density map is produced by the neural network, how do we know if it’s a good representation of the crowd image? We need a loss function to compare the human-created density maps and the neural-network-created density maps.

As I stated above, the sum of the pixels is the crowd count. So we could just take the sum of the ground truth density map and the sum of the predicted density map and compare the two, right? Think about the model spitting out a condensed blob of pixels onto a density map that just so happens to equal the actual crowd count. It wouldn’t actually represent the location of the heads or the density of the crowd but the count would technically be accurate.

To get a better idea of the problem we need to overcome consider Figure 23. Each image has the same number of black and white pixels, but the images themselves are radically different. Instead of evaluating the image globally (e.g. summation of pixels), we need to evaluate local patterns.

Figure 23: Each image has the same global statistics but are much different locally (source)

This section is going to get a bit “mathy” but I’ll do my best to describe what’s going on in plain English.

The loss function for this model is actually a combination of two loss functions. The first function of the two is Euclidean Loss, which is defined as:

Euclidean Loss is essentially pixel-wise mean squared error. First we take a pixel value from the ground truth density map (x) and subtract the corresponding pixel value from the predicted density map (y). Then we square the difference to make the error positive and to make large errors result in very large losses. Finally we add all the pixel errors together and divide by the total number of pixels (N) in the density map.

The Euclidean Loss is good for evaluating error on a pixel level, but it would be nice to compare larger local areas too. That’s where the Structural Similarity Index (SSIM) comes into play.

The idea of SSIM was first published in the 2004 paper Image Quality Assessment: From Error Visibility to Structural Similarity. It was first developed to measure the difference between an original image and an image that had been distorted from processing, compression, storage, etc. The SSIM is a value between 0 and 1 — zero meaning the two images are not similar at all and 1 meaning the two images are perfectly similar.

For the implementation of SSIM, we’ll use our old friends convolutions and Gaussian Kernels. Scanning a window (kernel) across an input — in this case the ground truth and predicted density maps — will allow us to calculate local statistics for each window position.

We can see another convolution example in the Figure 24 GIF. The blue matrix would be the density map (input), the shaded window moving across the density map would be the Gaussian Kernel, and the green matrix is the output. You’ll also notice a border around the input. This is referred to as padding and it ensures the output is the same dimension as the input. The padding was present in the convolutional layers as well.

Figure 24: Convolution with padding (source)

For every possible kernel location on both density maps, the local mean, the local variance and the local covariance is calculated. The formulas used to calculate these statistics may look unusual if you are reading them in the academic literature but I’ll go over the formulas in plain English.

Here is the formula for calculating the local mean:

Weird, huh? That’s not the mean formula we typically see. What this is actually doing is using the Gaussian Kernel weights (W) to convolve the predicted density map (F) at every possible kernel position (P). Again, the output is a matrix with the same dimensions as the density map. The above is the mean formula for the predicted density map but the same operation is done with the ground truth density map.

The local variance formula found in the academic literature is:

However, across almost every statistical tool (Python, Matlab, etc.), this more computationally efficient version is used:

If you’re wondering how this is formula is derived, I recommend looking at the explanation here. So to get local variance, we take the convolution of the Gaussian Kernel weights and the squared density map, then subtract the squared mean of the density map. The above variance formula is for the predicted density map but the same operation is done with the ground truth density map.

Next we calculate the local covariance. The way it is presented in the academic literature is:

But this version is used for the statistical tools:

To get local covariance, we first take the convolution of the Gaussian Kernel and the product of the predicted and ground truth density maps, then subtract the product of the predicted and ground truth density map means.

Next we calculate the SSIM value for each pixel by plugging the above statistics into this formula:

Although I won’t go into detail here, this formula is a combination of luminance, contrast and structure. The C1 and C2 variables are very small constants to prevent division by zero.

Once SSIM is calculated for each pixel, each pixel value is added together and the sum is divided by the total number of pixels (N). To get the loss, we simply subtract the final SSIM value from one. This loss is called Local Consistency Loss and it is defined as:

Now that we have Euclidean Loss and Local Consistency Loss, we combine the two to arrive at the final loss function:

The α is a weight to balance both losses. It was set to 0.001 for this project.

Now that I’ve beaten you over the head with the math, let’s relax and talk about what we’re actually doing with this loss function.

The loss function tells the model how well it’s performing during training. The model will use the ground truth and predicted density maps to calculate loss, adjust the kernel weights, then calculate loss again to see if performance improved from the kernel weights being adjusted. This happens many many times during training and is how the model eventually learns to count crowds.

We can think of this process like a blind person walking along a landscape. The person takes a step in one direction and uses what they know about their last position to determine if their current position is higher or lower in elevation. The person continues to take steps until they believe they‘ve found the lowest point on the landscape. Below is an image of the actual loss landscape of ResNet-56 (a popular neural network). The lowest point is where the model’s weights produce the most accurate results.

Figure 25: Loss Landscape of ResNet-56 (source)

Evaluating the Model

The model was trained in a Google Colab Pro notebook and used their NVIDIA Tesla P100 GPU and “High-RAM” virtual machine. The runtimes are longer in Colab Pro but the training would time out over night. In total, the training took about three days. After that long wait I can now predict crowd counts in milliseconds.

The average error (i.e. difference between actual and predicted crowd count) was 86.4 for Part A of the test data (larger crowds). The average error for Part B of the test data (smaller crowds) was 16.9. These numbers are good but slightly behind some of the newer models. My goal over the next few months will be to use the newer techniques as well as train on a lot more data to improve accuracy.

Let’s look at some real examples from the test data to see this model’s strengths and shortcomings.

We’ll start with the old black and white input image below. For such a large crowd size the model did well. It was able to detect the location of the crowd in the foreground and background. The model could also detect the denser areas of the crowd but had a little trouble with the smaller heads.

The model performs similarly with the image below. The model sees density but has a bit of trouble with the granularity of the crowd in the background and undercounts.

Remember how the Gaussian Kernel radius for the ground truth density maps is based on how close a head was to its nearest neighbors? As I mentioned before, this method is not perfect. Sometimes if there is a small head not close to any other heads, the Gaussian Kernel radius blows up to an unrealistic size. The radius becomes so large that the Gaussian Kernels become too faint to be seen on the ground truth density map. We can see from the example below that the model corrected for these discrepancies during training. The markings in the foreground of the predicted density map are much closer to the actual head sizes.

Below we can see an example of the model’s biggest flaw. It finds the crowd but also misidentifies some inanimate objects as people, and thus overcounts. You will also notice the model incorrectly seeing heads in the background of the GIF at the very top of the article. There is an analogy for this effect in psychology called pareidolia, which is when people mistakenly see faces in inanimate objects. I believe this flaw may be overcome with semantic segmentation to block out background objects, although that may overcomplicate the model.

The model performs best when there is good lighting, few background objects and a high camera angle. Both examples below demonstrate the excellent accuracy under these conditions.

Counting Recent Events

Over the past few years we have witnessed an incredible rise in crowd activity. There are many socioeconomic and political topics we could discuss to figure out why this is the case, but let’s save that discussion for another time. Instead, let’s use this new and exciting technology to visualize these phenomena.

In recent years mass shootings have sparked a national debate on gun regulation and public safety. To make their voices heard, gun control advocates took to the streets in early 2018 during the nation-wide March for Our Lives demonstration. Below we can see a crowd of almost 2,700 people gathered in Tampa Bay, Florida. Larger crowds gathered in DC, but I chose this image because it has a good angle and captures a large portion of the entire crowd.

Original Photo Credit: Luis Santana | Tampa Bay Times

Large crowds have also made their way to political rallies in recent years. Below you can see about 2,000 people gathered for a Trump rally. This rally took place in Tampa a few months after the March for Our Lives rally.

Original Photo Credit: James Borchuck | Tampa Bay Times

Crowd activity has not only increased in the United States but all over the world. Below is an image of 3,700 people gathered in Hong Kong in 2019 to protest a law that would allow extradition to mainland China. This is my favorite image because of the granularity the model is able to capture in such a massive crowd. The model even sees the people on the walkway over the street.

Original Photo Credit: South China Morning Post

In late May of 2020, a disturbing video surfaced of a police officer killing an unarmed man named George Floyd. This video caused a nation-wide uproar over police brutality and racial injustice. Millions of people across the country took to the streets in support of Floyd and the Black Lives Matter movement. One of the most active cities during these protests was Portland. Below is a picture of about 1,600 gathered in Portland to support racial justice on June 2nd, 2020.

Original Photo Credit: Chad DeHart | KGW

Below is a top-down image of a crowd in Merrick, NY that gathered to support George Floyd and BLM on June 4th, 2020.

Original Photo Credit: Al Bello | Getty Images

Below is a crowd of a little over 400 people gathered in Atlanta to support George Floyd and BLM on June 6th, 2020. This crowd is smaller but, remember, these crowds were all throughout the United States.

Original Photo Credit: Elijah Nouvelage | Getty Images

Below is a crowd of almost 800 people gathered in Washington DC to support racial justice on June 6th, 2020.

Original Photo Credit: Chip Somodevilla | Getty Images

Below is a crowd of almost 1,300 people gathered in Hollywood to support racial justice on June 7th, 2020.

Original Photo Credit: David McNew | Getty Images

Below is a crowd of 1,200 people gathered in Seattle to support racial justice on July 7th, 2020.

Original Photo Credit: Ruth Fremson | The New York Times

On June 15th of 2020, Chinese and Indian troops clashed over a border dispute in Galwan Valley, which left four people dead. A recent video released by the Chinese government shows a large group of Indian troops approaching the Chinese troops, so I decided to count them. The model saw 72 troops and the actual count is 64. I find this level of accuracy impressive because the model was never trained on images of people in military gear. Moreover, the colors of the military gear are meant to blend in to the background, but the model is still able to spot even the smallest heads in the image. The model does mistakenly see some heads in the rocks, hence the overcounting, but I’m still blown away by the model’s performance. This example shows the potential of this technology in Defense and National Security.

Original Photo Credit: Chinese Government

The most recent large crowd event in the United States was one of the most explosive political events in the country’s history. Tens of thousands of people gathered in Washington DC on January 6th, 2021 to protest Donald Trump’s election loss. Below is a portion of the crowd, about 2,500 people, that gathered to hear Trump speak. The model is picking up some background noise but, because it’s such a dense crowd, it may still be undercounting.

After Trump’s speech, things got ugly. Below is an image of over 500 people storming the Capitol building.

Original Photo Credit: Michael Nigro | Pacific Press

Next Steps

You may have noticed that there is a natural limit when counting crowds in images. It’s difficult to capture an event’s total crowd size in a single image. And, if we zoom out to capture the entire crowd, the crowd becomes too dense for our model to count accurately.

There are two ways to overcome this problem. First, the model can and will be trained on denser crowds to improve accuracy on very large crowds. Second, the technology’s ideal platform is a drone. With a drone, I could quickly maneuver to the best vantage point and scan over a crowd that can’t fit in only one frame. This would allow me to count events in their entirety (e.g. the total number of people who attended the January 6th rally/riot).

These are the obvious next steps, but there are many other ideas I have in mind. The ultimate goal is to view humans in ways we’ve never been viewed before so that we get a better idea of where we’re going. And if we get a better idea of where we’re going, maybe we can alter the path.

Conclusion

Thanks for reading!

References

Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, Yi Ma, Single-Image Crowd Counting via Multi-Column Convolutional Neural Network
Xinkun Cao, Zhipeng Wang, Yanyun Zhao , Fei Su, Scale Aggregation Network for Accurate and Efficient Crowd Counting
Zhou Wang, Alan Conrad Bovik, Hamid Rahim Sheikh, Eero P. Simoncelli, Image Quality Assessment: From Error Visibility to Structural Similarity