Time |
Nickname |
Message |
02:08
🔗
|
|
nertzy has quit IRC (Quit: This computer has gone to sleep) |
04:27
🔗
|
|
myself has joined #warrior |
04:28
🔗
|
yipdw |
so, detection like that would be useful; we don't currently have the capacity to implement it |
04:28
🔗
|
yipdw |
manpower that is |
04:28
🔗
|
yipdw |
I suspect something like it would be easiest to implement in a central checker, perhaps at the upload points |
04:29
🔗
|
yipdw |
not only to avoid dealing with sharing knowledge amongst many nodes but also because that's where you see the most |
04:29
🔗
|
myself |
Deliberately retrieve the same thing from multiple clients, compare results? |
04:29
🔗
|
yipdw |
possibly; we have code to retry / retrieve the same item multiple times but we usually do not use it |
04:29
🔗
|
yipdw |
the main reason is time constraints |
04:29
🔗
|
myself |
Yeah, the bigger any system gets, the more important it becomes to have resiliency, in general. I feel like that has to be a goal, but manpower is a sensible reason to not do it right now. |
04:29
🔗
|
aaaaaaaaa |
The problem with warcs is that the same content can still create different files |
04:29
🔗
|
yipdw |
that too |
04:30
🔗
|
yipdw |
that said we don't have a good definition of "sufficiently different" or a way to detect it, except perhaps in gross cases like "oh we had 200s and now it's lots of 302s" |
04:31
🔗
|
yipdw |
fortunately that seems like it's what usually happens, probably because it's also the easiest thing for a sysadmin to do |
04:31
🔗
|
aaaaaaaaa |
that or, "wait a minute all the items are the exact same size" |
04:31
🔗
|
myself |
Compressibility is often a useful metric. Concatenate both samples and compress; if it squishes to nearly the same size as just one sample, they weren't very different! |
04:31
🔗
|
yipdw |
that runs into problems when you have a lot of spam profiles |
04:32
🔗
|
yipdw |
etc. |
04:32
🔗
|
yipdw |
anyway, we do have a gigantic corpus of responses if you are interested in looking at this |
04:32
🔗
|
myself |
hmm, I was thinking on an individual-page level rather than a whole archive, but yeah.. |
04:32
🔗
|
yipdw |
like comb over the archiveteam collections |
04:33
🔗
|
myself |
Where would I start combing? I'm interested in noodling around to see if I can make any sense of it. Is there a gallery-of-bad-responses somewhere? Heh. |
04:33
🔗
|
yipdw |
there isn't one and that's something that we do need |
04:33
🔗
|
yipdw |
on a per-collection basis |
04:34
🔗
|
myself |
Developing a response-badness heuristic would be a start at noticing all sorts of problems as they happen. |
04:34
🔗
|
yipdw |
so https://archive.org/search.php?query=archiveteam will give you way too much data |
04:34
🔗
|
yipdw |
you may want to see perhaps, uh |
04:34
🔗
|
myself |
Yummy firehose.. |
04:34
🔗
|
yipdw |
maybe the Hyves and Splinder collections |
04:34
🔗
|
yipdw |
I name those two because we ended up breaking the sites a few times |
04:35
🔗
|
yipdw |
and I think they may have broken in creative ways |
04:35
🔗
|
yipdw |
but even that's several metric asstons of data |
04:35
🔗
|
myself |
what's that in Shitloads Avoirdupois? |
04:35
🔗
|
yipdw |
I really don't know where you'd start unless you had a heuristic, which is chicken-and-egging it |
04:36
🔗
|
yipdw |
OR you just dig through the data, One Terabyte of Kilobyte Age-style |
04:36
🔗
|
yipdw |
and from there you notice "oh this is all sorts of busted" |
04:36
🔗
|
yipdw |
do note that that sort of search takes years (and indeed the OToKA bit is ongoing) |
04:37
🔗
|
yipdw |
that said, if you want to just load up some WARCs (always a good start), try webarchiveplayer -> https://github.com/ikreymer/webarchiveplayer |
04:37
🔗
|
|
aaaaaaaaa has quit IRC (Leaving) |
04:38
🔗
|
yipdw |
if you're comfortable with administering python webapps try pywb -> https://github.com/ikreymer/pywb |
04:38
🔗
|
myself |
I'm not, but the first looks like a good place to start noodling around. |
04:38
🔗
|
yipdw |
you can also try IA's wayback -> https://github.com/internetarchive/wayback <- but in my experience pywb is much easier to get working and is more advanced wrt replay capabilities |
04:39
🔗
|
yipdw |
maybe there's some things IA's wayback handles that pywb doesn't, I haven't checked that deeply |
04:40
🔗
|
yipdw |
I was mostly impressed when I loaded up a YouTube grab done by ArchiveBot in pywb and saw the video playing |
04:40
🔗
|
yipdw |
that was a DAAAAMN moment |
04:40
🔗
|
myself |
I'm mostly just looking to use these to see if I can find error pages in Hyves or Splinder, right? and then look to see if I can detect those in a generic way. |
04:40
🔗
|
yipdw |
yesah |
04:41
🔗
|
yipdw |
you can't really find errors in WARCs if you can't load them up |
04:41
🔗
|
myself |
so the differences between players shouldn't matter much |
04:41
🔗
|
yipdw |
it can |
04:41
🔗
|
yipdw |
some players will choke on requests for resources that are actually in the WARC |
04:41
🔗
|
yipdw |
hyves/splinder are probably not using any features that would trigger that? but who knows |
04:41
🔗
|
yipdw |
I've seen the "choke" thing happen on infinite-scroll pages |
07:34
🔗
|
|
Atom__ has quit IRC (Read error: Connection reset by peer) |
07:36
🔗
|
|
Atom__ has joined #warrior |
13:17
🔗
|
|
Atom__ has quit IRC (Read error: Connection reset by peer) |
13:26
🔗
|
|
Atom__ has joined #warrior |
13:40
🔗
|
|
Atom__ has quit IRC (Ping timeout: 306 seconds) |
15:02
🔗
|
|
phuzion_ is now known as phuzion |
15:06
🔗
|
|
Fusl has quit IRC (hub.efnet.us irc.umich.edu) |
15:06
🔗
|
|
trs80 has quit IRC (hub.efnet.us irc.umich.edu) |
15:11
🔗
|
|
Fusl has joined #warrior |
16:00
🔗
|
|
trs80 has joined #warrior |
18:18
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
18:57
🔗
|
|
aaaaaaaaa has joined #warrior |
19:22
🔗
|
|
aaaaaaaa_ has joined #warrior |
19:23
🔗
|
|
aaaaaaaaa has quit IRC (Ping timeout: 600 seconds) |
19:24
🔗
|
|
aaaaaaaa_ is now known as aaaaaaaaa |
20:41
🔗
|
|
atlogbot has quit IRC (Remote host closed the connection) |
20:42
🔗
|
|
atlogbot has joined #warrior |
20:53
🔗
|
|
aaaaaaaa_ has joined #warrior |
20:53
🔗
|
|
aaaaaaaaa has quit IRC (Read error: Connection reset by peer) |
21:19
🔗
|
|
aaaaaaaa_ is now known as aaaaaaaaa |
21:48
🔗
|
|
aaaaaaaa_ has joined #warrior |
21:48
🔗
|
|
aaaaaaaaa has quit IRC (Read error: Connection reset by peer) |
21:50
🔗
|
|
aaaaaaaa_ is now known as aaaaaaaaa |
22:14
🔗
|
|
nertzy has joined #warrior |
23:02
🔗
|
|
nertzy has quit IRC (Quit: This computer has gone to sleep) |