How you can Use AI to Assist Discover Civilian Hurt

Between February 2022 and September 2025, Bellingcat employees and volunteers collected, geolocated, and shared greater than 2,500 incidents of civilian hurt following Russia’s full-scale invasion of Ukraine.

As a part of this effort, Bellingcat examined a brand new machine studying mannequin meant to rank Telegram social media posts on their probability of containing incidents of civilian hurt.

This novel methodology dramatically decreased the search and choice time required, liberating researchers to deal with verifying incidents of civilian hurt – not simply trying to find them.

This piece paperwork our methodology, moral issues and classes discovered within the hope that others researching comparable subjects can profit from our work.

Open supply analysis into civilian hurt remains to be a comparatively new area and it presents many challenges – one of many greatest is organising and sorting by means of the large quantity of consumer generated content material being produced to search out what’s related.

Machine studying, a type of synthetic intelligence that makes use of algorithms to establish patterns from giant quantities of information and make predictions, could make this job extra environment friendly.

With ongoing conflicts involving giant quantities of civilian hurt occurring in Sudan, and far of the Center East, this information goals to supply these overlaying these conflicts an instance of how machine studying can be utilized to assist discover and kind incidents. You may as well entry the Code Pocket book for our mannequin right here.

We outlined “civilian hurt” not simply as civilian deaths or accidents ensuing from armed battle, but additionally the broader and delayed results on civilians from psychological trauma, lack of livelihood, displacement, destruction of infrastructure and extra. This definition was knowledgeable by the Safety of Civilians guide on civilian hurt.

Preliminary Telegram Dataset

Every Telegram put up containing civilian hurt which had already been manually verified by researchers was used to construct an preliminary dataset of confirmed circumstances of civilian hurt, which information scientists name constructive cases. We collected a complete of 5,848 distinctive URLs for these Telegram posts. For our handbook assortment we reviewed posts on related Telegram channels, working by means of oldest to latest posts every day. Assuming {that a} given put up made it to our geolocated incidents checklist, it meant the researcher who flagged it additionally appeared on the posts that appeared earlier than and after it on Telegram and didn’t flag these ones, so we chosen the ten posts surrounding the verified civilian hurt put up as our further dataset of posts that didn’t comprise civilian hurt. After excluding any deleted or duplicate posts, we ended up with 48,545 non-civilian hurt posts, our adverse cases.

The selection to overrepresent adverse cases goals at higher reflecting the actual world and rising information obtainable for mannequin coaching.

We enriched every URL with metadata from the Telegram API, such because the time of publication, reactions or textual content material. As a few of these posts had been deleted, we accomplished the lacking information factors with beforehand preserved variations from our Auto Archiver database, solely obtainable for the constructive cases.

Characteristic Engineering

Coaching a machine studying mannequin requires numerical information, as these fashions compute a prediction rating based mostly on mathematical operations.

We constructed these by changing uncooked information from our preliminary dataset, similar to key phrases signalling potential civilian hurt, into numerical scores (or “options”) that the mannequin may interpret, with the intention of accelerating the mannequin’s capacity to establish patterns. This course of, often called function engineering, can considerably enhance mannequin outcomes as a result of it permits information scientists to recommend specific context data.

A full checklist of options we used to coach the mannequin will be discovered within the code pocket book accompanying this piece. Many options had been immediately impressed by researchers’ enter from their experiences manually screening circumstances of civilian hurt by sorting by means of a set variety of Telegram channels and inspecting every put up individually.

A number of of the options used had been immediately constructed from the metadata contained in every Telegram put up together with media_type, day_of_week; or binary ones: forwarded, edited and reply_to.

Different options included engagement data: views, forwards, total_reactions, and even particular person options for many used emojis together with the reaction_crying_face to depend 😭 emoji.

Changing Textual content to Numbers

To embed the expertise from the handbook assortment course of, researchers put collectively an inventory of key phrases each in Ukrainian and Russian that, to them, signalled posts prone to present civilian hurt. As an illustration, “Шахед” and “КАБ” translated to “Shahed” and “Guided aerial bomb” respectively. We created a numerical function to depend their frequency.

As well as, we included a number of generic English-language key phrases which meaningfully signalled potential civilian hurt, similar to “injured”, “college affected” and “hospital affected” that had been solely used for producing semantic similarity scores.

A semantic similarity rating is a calculation used to find out the proximity in which means between completely different phrases and phrases. To get the semantic similarity between the put up textual content and every of our key phrases, we represented every in an inventory of numbers by way of a Sentence Transformer mannequin, which converts phrases into numerical representations referred to as vectors that a pc can perceive.

We then calculated the extent of similarity between every vector utilizing cosine similarity, some of the common strategies for measuring similarity between two items of textual content.

As a result of how embeddings work, this calculation leads to a determine on a scale from -1 (no semantic proximity) to 1 (similar which means). For instance, the phrases “damage” and “injured” would have a excessive similarity rating, whereas “residential” and “injured” would have a adverse rating because the phrases should not semantically comparable.

Lastly, to allow the mannequin to establish the relevance of every put up to civilian hurt in Ukraine, we used a multilingual textual content transformer from the BERT household of language fashions to signify all the put up’s textual content as a vector of 768 numerical values. This mannequin can effectively signify textual content from many languages in a manner that captures which means: the identical sentence in several languages will generate comparable embeddings, and skilled machine studying fashions can detect patterns within the embeddings.

It is very important observe that for this preliminary prototype of a civilian hurt detection mannequin, we didn’t embrace any options derived from media content material similar to pictures and movies, though that may be a logical subsequent step in trying to enhance mannequin efficiency.

Choosing, Coaching and Evaluating Fashions

With 54,393 rows of 893 numerical options every, we chosen 4 machine studying algorithms to coach our predictive fashions.

We selected Logistic Regression as a baseline algorithm on account of its simplicity. We additionally chosen three different “greatest in school” fashions, Random Forest, XGBoost, and LightGBM. These selections centred on the interpretability of the fashions and their capacity to work on tabular information of this measurement. For instance, we averted neural networks on account of a scarcity of interpretability and since these fashions work greatest with a bigger dataset.

To genuinely assess the efficiency of the skilled fashions, we break up our dataset into three components:

A coaching set – the information the fashions had been skilled on (60 p.c of the total dataset’s rows)
A validation set – used for an middleman analysis when tuning mannequin parameters (20 p.c of all rows)
A take a look at set – hidden for the ultimate efficiency evaluation, so the fashions had been evaluated on unseen information (remaining 20 p.c of rows)

We used a stratified break up to divide the dataset as a substitute of a random break up. This methodology ensured the proportion of constructive cases (i.e. confirmed circumstances of civilian hurt) remained constant throughout all three units at about 11 p.c.

To measure the efficiency of machine studying fashions, we ran them by means of the take a look at set and measured the variety of right and incorrect predictions. Fashions output a probability between 0 and 1 that every Telegram put up accommodates civilian hurt, and we tried to discover a cut-off threshold that results in an excellent stability between flagging nearly each put up (0.1) or flagging only a few (0.9).

There are two predominant forms of analysis metrics to gauge a mannequin’s prediction energy. Recall asserts what fraction of constructive cases (i.e. identified civilian hurt posts) had been accurately flagged as such. Precision measures the fraction of posts flagged as civilian hurt which might be certainly civilian hurt posts.

Throughout the coaching section, we tuned the fashions to maximise common precision (PR-AUC), a metric that summarises precision throughout all recall ranges. Whereas this methodology additionally accounts for precision, it prioritises recall, which is preferable for this use case because it steers mannequin choice to scale back the variety of civilian hurt posts which might be skipped.

The next desk types fashions from greatest to worst PR-AUC towards a baseline of a coin-flip predictor. ROC-AUC and F1 are two different analysis metrics included as sanity checks. Merely put, ROC-AUC measures the chance of rating two cases, one adverse and one constructive, accurately; F1 balances precision and recall equally and its greatest cut-off threshold worth.

Mannequin take a look at scores comparability, XGBoost stands out in each related metric evaluated.

From these outcomes, we chosen XGBoost as our ultimate mannequin because it had the very best scores when put next throughout all metrics.

Decoding the Mannequin

As a result of these fashions are interpretable, we are able to perceive which options are probably the most helpful when predicting whether or not a put up contains civilian hurt. The above desk exhibits the highest 10 options that the majority strongly sign the XGBoost mannequin to decide:

semantic_keywords_similarity: the semantic proximity between the put up textual content and manually chosen key phrases “casualties”, “injury” and “civilian hurt”
bert: the mannequin was capable of discern which means from the textual content with the identical power as a few of the different options on this checklist – there are three circumstances of this within the prime 10
reaction_crying_face: reactions with crying face emojis on the put up
group_of_messages: whether or not a put up accommodates a number of media information
keywords_in_text: the variety of {custom} Ukrainian or Russian key phrases within the put up

These outcomes usually tally with what you may count on when choosing Telegram posts for cases of civilian hurt, together with that posts that generate quite a lot of emotional engagement and posts utilizing key phrases about civilian hurt had been amongst these most definitely to comprise content material associated to this matter. Not all fashions had the identical prime options as XGBoost. Actually, for the Random Forest mannequin crucial function was the variety of crying face emojis current in a put up, a comfortable sample highlighted by researchers when this technique was first imagined.

LLM Outcomes and Comparability

Retroactively, we determined to run a pattern of the identical take a look at dataset by means of completely different giant language fashions (LLMs) to gauge their capacity to make these similar predictions.

We aimed to incorporate an LLM-generated rating as an additional function for our skilled fashions, which might be captured as related if it correlated with the proper predictions.

To start out, we chosen two native fashions, the 1B and 4B variants of Gemma 3 from Google DeepMind, and two cloud-hosted fashions, Gemini 2.5 flash and Gemini 3.5 flash. With this choice, we hoped to check outcomes throughout a variety of fashions’ anticipated efficiency.

We generated a 400-row stratified pattern (preserving the identical proportion of actual civilian hurt cases) from the take a look at dataset used for the {custom} fashions. For every of the 4 LLM fashions, we ran two assessments: one the place solely the Telegram put up message was despatched, and one other together with each the message and the engineered options (excluding the textual content embeddings, because the mannequin had direct entry to the textual content). Within the immediate for every mannequin, we requested for a rating between 0 and 1. We then evaluated the outcomes as we did for the {custom} fashions.

The above desk exhibits that LLMs can certainly extract worth from the engineered options. All 4 LLMs surpassed the baseline Logistic Regression mannequin in our assessments, but none of them carried out higher than the opposite custom-trained fashions, and XGBoost remained the one with the very best PR-AUC.

Nonetheless, Gemini 2.5 Flash carried out higher than its newer model 3.5 and even achieved a barely larger greatest F1 rating than every other mannequin. Whereas this can be a good outcome, for the flagging of civilian hurt posts, the PR-AUC stays the essential metric, because it captures the mannequin’s capacity to establish rare cases of civilian hurt whereas minimising false positives.

Moral Concerns

Introducing an instrument of automated decision-making right into a means of detecting civilian hurt brings inherent moral questions. These embrace automation bias, or how people are likely to blindly place religion in machine-generated suggestions; algorithmic bias, or how the outcomes of those fashions echo the identical patterns current within the coaching information, together with under- or over-representation of forms of civilian hurt.

The choice to check an automatic methodology for this explicit mission got here from the very fact that there have been restricted assets for each steps within the course of – the detection of potential civilian hurt and its precise verification. Traditionally, we constructed an unlimited backlog of unverified incidents as a result of quite a lot of time needed to be spent on monitoring the latest occasions in order that potential proof could be captured and preserved as quickly as potential.

The automation of this course of additionally decreased the publicity of researchers to a major quantity of disagreeable and distressing visible and textual content content material, decreasing the burden of publicity to traumatic content material.

For this mission, we tried to ameliorate the moral challenges with plenty of methods together with randomly flagging posts not captured by any mannequin, monitoring which options fashions relied on to make choices, and by doing historic comparisons of patterns in information.

Moreover, as acknowledged above, for this preliminary prototype of a civilian hurt detection mannequin we didn’t embrace any options derived from the media content material itself. Sooner or later, it will be a logical subsequent step in trying to enhance the mannequin efficiency, to incorporate the media from the posts – however utilizing AI to overview precise media comes with further moral challenges similar to mannequin bias.

Due to the opaque possession of many LLM firms and their generative nature, the usage of LLMs for an additional function introduced further moral challenges together with privateness and security issues contemplating the delicate nature of the information. Our mannequin didn’t depend on LLMs, although we retroactively ran a pattern by means of it.

How the Mannequin Matches into the Greater Image

After choosing this mannequin, we created a consumer interface the place researchers may view an inventory of Telegram posts sorted from most to least prone to comprise indications of civilian hurt. The consumer interface was designed for fast triage and integration, the place a constructive affirmation from researchers would immediately ship the put up to the Auto Archiver (Bellingcat’s software for preserving digital content material) after which switch it to ATLOS (our inner collaborative verification platform). Bellingcat employees and volunteers may then manually confirm incidents. Researcher enter was continually saved in order that this information could possibly be used to enhance the mannequin sooner or later.

Preliminary suggestions indicated that the AI mannequin was helpful. Not solely had been we capable of cut back time and hurt from scouring by means of dozens of battle reporting Telegram channels, researchers additionally reported that the stream of recent posts being added to the verification backlog had been capturing actual and various circumstances of civilian hurt.

Regardless of the deal with civilian hurt and Telegram (extremely common in Ukraine and Russia), this pipeline is generic and will be tailored to different battle monitoring duties. How simply this may be executed does rely upon how open the social media platform is and whether or not it’s potential to scrape posts from it. Aside from that, it’s simple to include new options and information, and low-cost to routinely retrain, take a look at and deploy fashions because the system receives extra human enter.

Trying ahead, sorting by means of overwhelming quantities of information in a battle will proceed to be difficult. Hopefully, this technique may also help newsrooms, battle monitoring organisations, and others discover the stability between moral issues and assets with the intention to perform open supply investigations on civilian hurt and human rights violations.

Bellingcat is a non-profit and the flexibility to hold out our work relies on the sort assist of particular person donors. If you want to assist our work, you are able to do so right here. You may as well subscribe to our Patreon channel right here. Subscribe to our E-newsletter and comply with us on Bluesky right here, Instagram right here, Reddit right here and YouTube right here.

What's Hot

Charli XCX’s ‘Wink Wink’ Video: Daring Visuals and Fan Reactions

Youngsters Defy French Seashore Ban, Reclaim Sussex Shoreline

Aeluma Inventory: Undervalued Photonics Tech, Speculative Purchase

How you can Use AI to Assist Discover Civilian Hurt

Oregon Water Regulation Advantages Rich Landowners at Farmers’ Expense — ProPublica

[DECODED] Kiko Barzaga strikes once more as Tacloban capturing triggers disinfo, muddles chats on juvenile justice regulation

MEDAL TALLY: Palarong Pambansa 2026

Might 2026 Chemical Engineers Laptop-Primarily based Licensure Examination

Charli XCX’s ‘Wink Wink’ Video: Daring Visuals and Fan Reactions

Youngsters Defy French Seashore Ban, Reclaim Sussex Shoreline

Aeluma Inventory: Undervalued Photonics Tech, Speculative Purchase

F5 at 30, with CEO François Locoh-Donou – GeekWire

Latest Posts

Charli XCX’s ‘Wink Wink’ Video: Daring Visuals and Fan Reactions

Youngsters Defy French Seashore Ban, Reclaim Sussex Shoreline

Aeluma Inventory: Undervalued Photonics Tech, Speculative Purchase

What's Hot

How you can Use AI to Assist Discover Civilian Hurt

Preliminary Telegram Dataset

Characteristic Engineering

Changing Textual content to Numbers

Choosing, Coaching and Evaluating Fashions

Decoding the Mannequin

LLM Outcomes and Comparability

Moral Concerns

How the Mannequin Matches into the Greater Image

Related Posts