[01:00] *** JesseW has joined #urlteam
[02:40] <wumpus> If urlteam data did go into the wayback, then all the users of our upcoming 404 browser integrations can enjoy urlteam's work.
[02:50] <Frogging> it would be easy to put it into wayback, wouldn't it? isn't it all WARCs with 301 records?
[02:52] <bwn> it's in beacon link dump format
[02:53] <bwn> https://gbv.github.io/beaconspec/beacon.html
[02:53] *** VADemon has quit IRC (Quit: left4dead)
[02:57] <xmc> i still can't believe that the stupid ad-hoc format that i came up with for my initial archiveteam experiments has become an internet-draft
[02:57] <xmc> makes me pretty sad
[03:01] <wumpus> no date info for WARC purposes
[03:01] <xmc> yeah it's a terrible format and i would like to go back in time and start saving warcs
[03:06] <Frogging> oh shit it's not warcs?
[03:06] <Frogging> not that I had any actual reason to think it was, but I'm surprised somewhat
[03:07] <JesseW> it's a lot smaller than WARCs would be
[03:07] <wumpus> You could recrawl the known IDs into WARCs, it's only 4.7 billion urls
[03:08] <JesseW> which I appreciate because it means I can mirror all of it more easily
[03:08] <JesseW> Yeah, I think re-crawling the known ones would be a great project.
[03:08] <JesseW> I just haven't gotten around to doing it
[03:09] <wumpus> Well, do you want the 404 handler of the web to process these links? Or to minimize your personal disk space? You can always go warc -> beacon ;-)
[03:09] <JesseW> (well, I have actively avoided working on that -- but I'm glad to cheer on others who do)
[03:10] <JesseW> that is a sensible argument for keeping the full headers from the terroroftinytown results, yes
[03:12] * JesseW is encouraged by all this discussion to look over the contributions people have made recently and see what can be added to the tracker
[03:17] <JesseW> shortdoi.org seems feasible to scrape
[03:17] <JesseW> making a project now
[03:19] <wumpus> the first urlteam to WARC project?
[03:19] <JesseW> ha -- no, just another one using the currently lossy format
[03:19] <wumpus> just kidding
[03:20] <JesseW> heh. but if you keep poking, I might make a PR (although I'd much prefer if you or someone else did so)
[03:21] <JesseW> ok,  shortdoi-org started
[03:22] <JesseW> note, it actually maps from dx.doi.org/10/ not shortdoi.org, as that is just an initial redirect
[03:24] <JesseW> ok, 681 found so far
[03:24] <JesseW> it seems to have missed some, though, which I'm confused by
[03:32] <bwn> i've been working on aggregating the lists in my spare time, i was looking to do something similar to what you're discussing above (jessew: we were discussing it briefly a while back, a run through to make sure everything is in the wayback, archivebot-like thing to get them if not)
[03:33] <bwn> wb jessew, btw :)
[03:33] <JesseW> yeah, I remembered you were working on that
[03:33] <JesseW> I think someone else was, too -- maybe Frogging or vitzli (not sure if I'm misspelling their nick)...
[03:34] <Frogging> sadly I haven't been working on anything
[03:34] <bwn> i had started with luckcolor's list of dead shorteners though so we can't re-crawl them :\
[03:35] <Frogging> except $dayjob 
[03:35] <JesseW> Frogging: ah, ok -- I think I got you confused with someone else
[03:35] <JesseW> $dayjob is useful and important. At a minimum, it enables you to pay for #archivebot pipelines (which is *GREATLY* appreciated)
[03:35] <Frogging> that is true :p
[03:37] <JesseW> grumble -- I screwed up the alphabet on shortdoi-org
[03:38] <bwn> jessew: i think there are some 3 digit identifiers, i was going to update the wiki.. i forgot to add a note about doing a bit more research
[03:38] <JesseW> yeah, I knew about the 3 digit identifiers -- the error I made was that there are identifiers that *start* with "a", so "a" can't be the first character in the alphabet
[03:39] <bwn> ah, cool
[03:40] <JesseW> we may have missed various other synchronous ones with a similar error
[03:40] <JesseW> someone (else) should probably go through them and check
[03:41] <JesseW> spot check, I mean -- then I can fix the alphabets and we can grab them
[03:43] <JesseW> ok, shortdoi-org queue up to 40 -- it should be done in a couple of hours
[03:48] <JesseW> flip-it is *still* going, since June 6th -- 21,867,543 found
[03:53] <bwn> moar urls!
[03:54] <bwn> ah, _i_ messed up the alphabet you mean.. :)  i didn't see '0', oops
[03:55] <JesseW> no, there *isn't* an '0' -- I just need something at the beginning, because the first character in the alphabet doesn't get used as an initial letter in the generated shortcodes
[03:55] <JesseW> and 'a' *is* used that way in shortdoi-org
[03:55] <bwn> ah, i gotcha now
[03:56] <JesseW> yeah, it's basically just a bug in terroroftinytown
[03:56] <JesseW> albeit one that (maybe) doesn't cause us to miss much (as it may be a similar bug in various of the shorteners we work over)
[03:57] <bwn> ah
[04:14] *** Start_ has joined #urlteam
[04:14] *** Start has quit IRC (Ping timeout: 260 seconds)
[04:16] <JesseW> shortdoi-org is done
[05:00] *** JesseW has quit IRC (Ping timeout: 370 seconds)
[05:17] <wumpus> That was... short.
[05:18] <wumpus> (sorry)
[05:21] <bwn> *drum hit*
[05:22] <bwn> it gets through the sequential shorteners pretty quickly from what i've seen
[05:23] <bwn> question from above: is there any way to massage the data we have for the dead shorteners into something that's usable for wayback/your 404 handler?
[05:23] <bwn> s/from/re/
[06:12] *** JesseW has joined #urlteam
[06:38] <wumpus> The main issue is that we would like accurate crawl dates in WARC files.
[06:39] <wumpus> so if we create WARC from another format, we'd like that info to be available. And, it does not exist in beacon.
[06:39] <JesseW> they are dated approximately daily (although more recent ones are more like weekly)
[06:41] <JesseW> but re-crawling all the non-dead ones is certainly a good idea, because as a nice side-effect, it would let us grab the target pages, too (which we currently don't)
[06:41] <wumpus> So, if we're going to hand out affidavits to courts, as you can imagine "approximately" is not a good thing.
[06:41] <wumpus> and indeed, it would be interesting to also have the target pages.
[06:43] <JesseW> I'd hope that the question of "what exact minute did you check that this shortcode pointed to this address" wouldn't come up very often in affidavits -- but I can see the problem if there's no way to *mark* some entries as "circa"
[06:43] <wumpus> Just as an example, we have a 80-billion-page horizontal crawl for which we have yet to figure out accurate dates for. This makes me very sad.
[06:43] *** dashcloud has quit IRC (Ping timeout: 244 seconds)
[06:43] <wumpus> And no, no existing "circa" system.
[06:43] <JesseW> what about for the *very* early material (I'm thinking of the BBC website from the very early 1990s, that they gave you, and got converted)
[06:43] <wumpus> I don't know about that one, yet.
[06:44] * JesseW goes to try and dig up a link
[06:44] <wumpus> Personally I'm eager to get some super-early Stanford crawls into the wayback, so that the initial Darwin Awards webpages are properly archived.
[06:44] <JesseW> awesome
[06:45] <wumpus> (the CS department asked Wendy to leave, because it was too popular. :-) )
[06:46] <JesseW> ha
[06:46] <JesseW> http://www.forbes.com/sites/kalevleetaru/2015/11/16/how-much-of-the-internet-does-the-wayback-machine-really-archive <- interesting
[06:48] <wumpus> Uh, yeah. I encourage anyone interested in that article to try out https://web-beta.archive.org/ because it provides a lot more info than was available when that article was written.
[06:48] <JesseW> heh
[06:48] *** dashcloud has joined #urlteam
[06:49] *** svchfoo3 sets mode: +o dashcloud
[06:51] <wumpus> Among other things, you can see how important ArchiveTeam is for many sites. I never appreciated you guys properly until I built that thing.
[06:51] <wumpus> Now I'm a fan!
[06:51] <JesseW> hehheheh
[06:51] <wumpus> Albeit not a fan of non-WARC stuff.
[06:52] <JesseW> yeah, urlteam is somewhat of a red-headed stepchild in someways
[06:58] <wumpus> SketchCow mostly has you guys moving in the right direction, I'm just trying to fill in a few Wayback-specific details.
[06:58] <JesseW> it's very welcome.
[06:59] * JesseW is not finding the bbc collection I was thinking of -- I'll let you know if it pops up
[07:01] *** dashcloud has quit IRC (Read error: Operation timed out)
[07:04] <bwn> it sounds like something that monitors urlteam exports, grabs them and generates WARCs might be worthwhile going forward?
[07:05] <JesseW> that would likely be a good workaround -- eventually, I think integrating full capture into terroroftinytown would be better
[07:06] <JesseW> but even before making something that monitors, just writing a Warrior project that goes through the existing urlteam exports and makes WARCs from them would be great
[07:09] *** dashcloud has joined #urlteam
[07:09] *** svchfoo3 sets mode: +o dashcloud
[07:14] *** JesseW has quit IRC (Ping timeout: 370 seconds)
[09:08] *** Fusl has quit IRC (Read error: Operation timed out)
[09:12] *** WinterFox has joined #urlteam
[11:40] <luckcolor> JesseW i would suggest splitting the work
[11:40] <luckcolor> i mean
[11:41] <luckcolor> if we are going that ropute i suggest that we keep the current setup
[11:41] <luckcolor> and then have some servers that receive work
[11:41] <luckcolor> and print out warcs with only the 301 records
[11:42] <luckcolor> and then in other warcs we do archive only like crawl
[11:42] <luckcolor> I mean i would prefer to have the data separated: beacon, becaonwarc, archived urls warc,
[11:59] *** Fusl has joined #urlteam
[12:28] *** dashcloud has quit IRC (Read error: Operation timed out)
[12:32] *** dashcloud has joined #urlteam
[12:32] *** svchfoo3 sets mode: +o dashcloud
[13:41] *** dashcloud has quit IRC (Read error: Operation timed out)
[13:45] *** dashcloud has joined #urlteam
[13:46] *** svchfoo3 sets mode: +o dashcloud
[14:18] *** luckcolor has quit IRC (Remote host closed the connection)
[14:19] *** luckcolor has joined #urlteam
[14:24] *** luckcolor has quit IRC (Read error: Connection reset by peer)
[14:25] *** luckcolor has joined #urlteam
[14:33] *** luckcolor has quit IRC (Read error: Connection reset by peer)
[14:33] *** luckcolor has joined #urlteam
[14:43] *** luckcolor has quit IRC (Read error: Connection reset by peer)
[14:55] *** luckcolor has joined #urlteam
[15:21] *** WinterFox has quit IRC (Remote host closed the connection)
[15:29] *** SilSte has quit IRC (Read error: Operation timed out)
[15:30] *** JesseW has joined #urlteam
[15:55] *** JesseW has quit IRC (Read error: Operation timed out)
[16:10] <JW_work1> luckcolor: that does make sense — but it does add load. Why not just save the full headers in the initial pass? (I agree about grabbing the targets in a separate pass)
[16:10] *** Start_ is now known as Start
[16:11] <luckcolor> well in this manner we have the "simple to manage" text file with the lsit of urls because that's what beacon basically is and we also then have warc
[16:11] <luckcolor> i mean if we have a script to easily extract becaon or just a url list from warc then yesy we can omit the first part
[16:13] * luckcolor needs url lists for the resolve that he's making
[16:13] * luckcolor *resolver
[16:14] <JW_work1> yeah, I certainly would *produce* both beacon and WARCs
[16:14] <JW_work1> it's just a matter of whether the initial probing throws away the header info or not
[16:14] <luckcolor> no ofc not
[16:15] <JW_work1> well, it currently *does* — that's what I was thinking we should (eventually) fix
[16:15] <luckcolor> yeah the change would be so that the crawler will generate mini warcs
[16:15] <luckcolor> that we can collect
[16:15] <luckcolor> :P
[16:16] <JW_work1> yes
[16:16] <luckcolor> mini warcs team NEW Urlteam technology
[16:16] <luckcolor> *mini warcs! NEW Urlteam technology
[16:19] <luckcolor> if we are going this rate i don't reccomand using wpull
[16:20] <luckcolor> as it would be a hassle to succefully ship it to the crawlers
[16:31] <luckcolor> *rate > route
[17:09] <xmc> yes! mini warcs!
[17:09] <xmc> <N3
[17:09] <xmc> <3
[18:01] <luckcolor> I know right
[18:01] <luckcolor> So tiny and cute warcs :P
[18:02] <xmc> like cocktail sausages
[18:02] * luckcolor goes to lookup what are cocktail sausages
[18:03] <luckcolor> ah i see what you mean XD
[18:06] *** VADemon has joined #urlteam
[18:07] *** SilSte has joined #urlteam
[18:33] <luckcolor> so JesseW do you approve of the tiny warc technology (idea) for urlteam crawlers? :)
[18:34] <luckcolor> actually mini warcs
[18:34] <luckcolor> because mini is better
[19:02] <JW_work1> sure, works for me
[20:58] *** JW_work has joined #urlteam
[20:59] *** JW_work1 has quit IRC (Read error: Operation timed out)
[22:17] *** SilSte has quit IRC (Read error: Operation timed out)
[22:26] *** SilSte has joined #urlteam