This Week In Media Measurement

Transcript 27 lines

Cold Open Stats Overview Paper Walkthrough free_promo

Cold Open

Jenny When you search for an answer now, do you still feel like you are choosing what to read?

Davis Honestly, less than I want to admit, because if the box gives me a clean answer in ten seconds, my good intentions about checking three links get very fragile.

Jenny That's exactly what makes me uneasy, because one study this week found Google's AI answers on about one in seven trending searches, and about one in nine tiny checkable claims wasn't backed up by the pages it cited.

Davis So the middle ground is not panic and not trust, it's receipts, and today we're asking how you measure whether attention, authority, or action actually happened...welcome to This Week In Media Measurement on paperboy.fm.

Stats Overview

Davis This week starts big, then tightens fast: 1,738 total hits, 106 qualified papers, and 308 unique authors across 27 countries.

Jenny And the split is the story. Total hits rose by 96, about 5.8 percent, but qualified papers fell by 11, about 9.4 percent. So the search caught more material, while fewer studies actually fit the measurement question. Is that noise, or are more papers talking platforms without measuring what worked?

Davis The topic mix points to platforms as the center of gravity. Social media led with 32 papers, then social media marketing had 5, while education, consumer behavior, digital marketing, and misinformation each had 4. That fits the through-line: attention, authority, and action are being packaged by platforms, but the outcomes are all over the map.

Jenny The methods make that even clearer. Qualitative work led with 32 papers, meaning interviews, observations, or close reading of cases. Survey was right behind at 31. Then quantitative work had 18, and content analysis had 11. So we’re getting a lot of what people say, see, and interpret, but fewer clean tests of cause and effect.

Davis The author pool narrowed too: 308 authors is down 41 from last week, and 27 countries is down 4. Indonesia led with 13 papers, India had 8, China had 5, and the U.S. and Russia had 4 each. So the week is global, but less spread out than the last episode.

Jenny One last texture check: 114 authors, or 37 percent, were first-time authors, meaning their first-ever paper in the metadata, not just new to our feed. Another 131, or 42.5 percent, were emerging authors, and 63, or 20.5 percent, were experienced. That’s a young author mix, which can bring fresh cases, but I’d want to see which findings survive a second study.

Paper Walkthrough

Paper 1 Measuring Google AI Overviews: Activation, Source Quality, Claim Fidelity, and Publisher Impact

Jenny Alright, let's get into the papers with Measuring Google AI Overviews. Hao Xu, Umar Iqbal, and Jacob Montgomery treat Google's generated answer box as something you can audit, not just something you notice, across fifty-five thousand three hundred ninety-three trending queries in nineteen topic areas.

Jenny The plain finding is that AI search isn't a thin wrapper on classic search. AI Overviews appeared on thirteen point seven percent of all queries, but on sixty-four point seven percent of question-shaped queries, and nearly thirty percent of the pages they cited weren't on the regular first page at all.

Davis If the cited sources look credible, how much should we still worry about unsupported claims? Like, for a person searching one question, is the risk the source list, or the sentence Google writes on top of it?

Jenny That's exactly where they press. They broke the answers into ninety-eight thousand twenty atomic claims, meaning tiny checkable statements, and eleven point zero percent weren't supported by the pages Google cited; their term is claim fidelity, which just means whether the answer actually follows from its sources. The big limit is that this is a forty-day snapshot of one fast-changing product, from March thirteenth to April twenty-first, twenty twenty-six, so it's strong measurement but not the last word.

Davis The publisher consequence is pretty concrete. Well over half of AIO-cited pages carried display ads, so if Google answers the question without the click, the publisher can lose the ad impression while sponsored ads still sit on Google's page. At this sample size, that's too big to shrug off, and it fits the Metrics Become Narratives thread: the answer box tells a story of authority, but publishers and advertisers need a separate audit because this citation system isn't the same as ranking.

Paper 2 Unsteady Metrics and Benchmarking Cultures of AI Model Builders

Davis That Google paper had ninety-eight thousand twenty tiny claims, and this next one asks a similar measurement question from the model-builder side. Stefan Baack, Christo Buschek, and Matyáš Boháček call it Unsteady Metrics and Benchmarking Cultures of AI Model Builders, and it's basically about how AI scorecards become part of the sales pitch.

Davis They looked at two hundred thirty-one benchmarks highlighted across one hundred thirty-nine model releases in twenty twenty-five, from eleven major AI builders. A benchmark is just a standard test you use to compare systems, but the punchline is awkward: sixty-three point two percent of the highlighted benchmarks were used by only one builder, so the scoreboard often isn't shared enough to be a scoreboard.

Jenny So what would make an AI benchmark feel like measurement instead of a sales slide? Is it repeat use across companies, or is the deeper problem that each company gets to decide which test counts as intelligence this week?

Davis The authors do both a dataset analysis and a qualitative read of how companies describe the tests. They also build a taxonomy, meaning a shared map of claimed skills, because the same kind of benchmark gets labeled as reasoning, knowledge, coding, or general ability depending on the builder's story. The limit is important: this is one year of highlighted public benchmarks, so it measures public benchmark culture, not every private evaluation happening inside the labs.

Jenny That makes the Metrics Become Narratives thread feel very literal. If a company says its model is best, I want to know whether best means better on a broadly used test like GPQA Diamond, LiveCodeBench, or AIME twenty twenty-five, or better on the slice of the exam it chose to put in the press release. That's not cynicism; it's just asking whether the ruler is shared before we argue about who measured taller.

Paper 3 Digital Information Cascades and Sustainable Visitor Flow Management: Evidence from GPS Trajectories and Social Media During an Urban Festival

Jenny That shared-ruler question carries right into Digital Information Cascades and Sustainable Visitor Flow Management, except the ruler here is not a model leaderboard, it's whether online chatter actually moved people around Bangkok during Songkran.

Jenny Wang and Xing link ninety-five thousand six hundred ninety-two taxi GPS trajectories with five thousand nine hundred ninety-five geotagged Twitter posts from the twenty nineteen festival, and the plain finding is pretty direct: districts with more recent social buzz became more likely taxi destinations.

Davis But how do they separate social buzz causing visits from popular districts simply getting more buzz because everyone is already there?

Jenny They lag the buzz measure, so posts come before the taxi choice, then use a conditional logit model, which is just a way to compare which district a visitor picked against the districts they could have picked, and they add placebo permutations plus a Bartik shift-share instrument, meaning an outside push used to test whether the relationship still looks causal rather than circular.

Jenny The festival-period instrumental-variable estimate is beta equals plus zero point zero one nine, with p less than zero point zero zero one, and it's fifty-one percent larger than the ordinary within-period estimate of plus zero point zero one two, which they read as sparse Twitter data understating the real effect; the limit is that this is strong evidence for Bangkok's Songkran, not a guarantee that every festival, city, or tourist market behaves the same way.

Davis That's the From Exposure To Action thread in a very literal form: if you're running a location campaign, don't stop at counting posts or impressions, pair the social signal with movement data and a time lag before you claim the buzz drove foot traffic.

This Week In Media Measurement

Episode

Cold Open

Stats Overview

Paper Walkthrough

Paper 1 Measuring Google AI Overviews: Activation, Source Quality, Claim Fidelity, and Publisher Impact

Paper 2 Unsteady Metrics and Benchmarking Cultures of AI Model Builders

Paper 3 Digital Information Cascades and Sustainable Visitor Flow Management: Evidence from GPS Trajectories and Social Media During an Urban Festival

free_promo

Other Episodes

This Week In Media Measurement

Episode

Cold Open

Stats Overview

Paper Walkthrough

Paper 1 Measuring Google AI Overviews: Activation, Source Quality, Claim Fidelity, and Publisher Impact

Paper 2 Unsteady Metrics and Benchmarking Cultures of AI Model Builders

Paper 3 Digital Information Cascades and Sustainable Visitor Flow Management: Evidence from GPS Trajectories and Social Media During an Urban Festival

free_promo

Other Episodes

Get This Week In Media Measurement by email