Hard-drives keep getting bigger but we keep losing space. It seems that the size of files is in lock-step with the size of hard-drives. (or is it the other way around?) I suspect that there is a fair amount of file-size-bloat (much like software-bloat) plaguing the media we have today.

Let’s start with using e621 and just analyzing the average file size over time.

import db from './../../_code/database.mjs';
import { file_here } from './../../_code/file.mjs'
import { LineChart } from 'd3-charts';

const all_file_sizes = db.prepare(`
	with dates as (
		select
			-- Database stores in 1000 * Unix time and we want to convert to a single day
			floor(strftime('%J', created_at / 1000, 'unixepoch')) as julian_day,
			file_size
		from posts_metadata
	)
	select
		unixepoch(julian_day) * 1000 as x,
		avg(file_size) / (1000*1000) as y
	from dates
	group by julian_day
	order by x asc; 
`).all();

const file_path = file_here(import.meta.url, 'daily_file_size.svg');
new LineChart({
	title: {text: `Daily average file size - ${db.most_recent_date}`},
	x_label: {text: 'Year'},
	y_label: {text: 'Average size of file (MB)'},
	x_scale: {type: 'time'},
	y_scale: {min: 0}
}).draw(all_file_sizes).save(file_path);

Whoa, that looks like it has some outliers. I can not pin dates to the outliers that match results on e621, but almost assuredly they are caused by large flash files on days with very low. The most recent outlier was after the death of flash so it must be because of a lot of videos uploaded on the same day.

Lets do the same thing, but smooth the increase the average to yearly and separate the data by file type.

import db from './../../_code/database.mjs';
import { file_here } from './../../_code/file.mjs'
import { LineChart } from 'd3-charts';

const all_file_sizes = db.prepare(`
	with dates as (
		select
			floor(strftime('%J', created_at / 1000, 'unixepoch') / 365) as year,
			file_size,
			file_ext
		from posts_metadata
		where file_ext not like 'del.%'
	)
	select
		unixepoch(year * 365) * 1000 as x,
		avg(file_size) / (1000*1000) as y,
		file_ext as key
	from dates
	group by file_ext, year
	order by x asc; 
`).all();

const file_path = file_here(import.meta.url, 'yearly_file_size.svg');
new LineChart({
	title: {text: `Yearly average file size - ${db.most_recent_date}`},
	x_label: {text: 'Year'},
	y_label: {text: 'Average size of file (MB)'},
	x_scale: {type: 'time'},
	y_scale: {min: 0}
}).draw(all_file_sizes).save(file_path);

So now we know that there is a visible increase in average file size over time on e621 for all file types but that is not the whole story. If you were to do this analysis by going to your favorites and looking at file sizes, you might not come to the same conclusion. That is because there is a large variance for file size regardless of image format.

with dates as (
	select
		floor(strftime('%J', created_at / 1000, 'unixepoch') / 365) as year,
		file_size as y,
		file_ext
	from posts_metadata
	where file_ext not like 'del.%'
)
select
	unixepoch(year * 365) * 1000 as x,
	avg(y) as mean, 
	-- Alternative method for variance that we can use here.
	-- (sum(y)^2 - sum(y^2)/n) / (n-1)
	sqrt( (sum(y*y) - sum(y)*sum(y)/count(*)) / (count(*) - 1)) as standard_deviation,
	file_ext as key
from dates
group by file_ext, year
order by x asc;

I am leaving out a graph because I am not sure how to best present this information and am weary it will be taken out of context. I can say that standard deviation was the same or larger for all averages over time. This means the distribution for file size is very skewed.

I could plot the distribution of file size for a specific file format but it is not interesting (I did plot it). It looks almost like the Exponential Distribution (almost because 0 KB is not the mode).