[01:00] *** JesseW has joined #urlteam [02:40] If urlteam data did go into the wayback, then all the users of our upcoming 404 browser integrations can enjoy urlteam's work. [02:50] it would be easy to put it into wayback, wouldn't it? isn't it all WARCs with 301 records? [02:52] it's in beacon link dump format [02:53] https://gbv.github.io/beaconspec/beacon.html [02:53] *** VADemon has quit IRC (Quit: left4dead) [02:57] i still can't believe that the stupid ad-hoc format that i came up with for my initial archiveteam experiments has become an internet-draft [02:57] makes me pretty sad [03:01] no date info for WARC purposes [03:01] yeah it's a terrible format and i would like to go back in time and start saving warcs [03:06] oh shit it's not warcs? [03:06] not that I had any actual reason to think it was, but I'm surprised somewhat [03:07] it's a lot smaller than WARCs would be [03:07] You could recrawl the known IDs into WARCs, it's only 4.7 billion urls [03:08] which I appreciate because it means I can mirror all of it more easily [03:08] Yeah, I think re-crawling the known ones would be a great project. [03:08] I just haven't gotten around to doing it [03:09] Well, do you want the 404 handler of the web to process these links? Or to minimize your personal disk space? You can always go warc -> beacon ;-) [03:09] (well, I have actively avoided working on that -- but I'm glad to cheer on others who do) [03:10] that is a sensible argument for keeping the full headers from the terroroftinytown results, yes [03:12] * JesseW is encouraged by all this discussion to look over the contributions people have made recently and see what can be added to the tracker [03:17] shortdoi.org seems feasible to scrape [03:17] making a project now [03:19] the first urlteam to WARC project? [03:19] ha -- no, just another one using the currently lossy format [03:19] just kidding [03:20] heh. but if you keep poking, I might make a PR (although I'd much prefer if you or someone else did so) [03:21] ok, shortdoi-org started [03:22] note, it actually maps from dx.doi.org/10/ not shortdoi.org, as that is just an initial redirect [03:24] ok, 681 found so far [03:24] it seems to have missed some, though, which I'm confused by [03:32] i've been working on aggregating the lists in my spare time, i was looking to do something similar to what you're discussing above (jessew: we were discussing it briefly a while back, a run through to make sure everything is in the wayback, archivebot-like thing to get them if not) [03:33] wb jessew, btw :) [03:33] yeah, I remembered you were working on that [03:33] I think someone else was, too -- maybe Frogging or vitzli (not sure if I'm misspelling their nick)... [03:34] sadly I haven't been working on anything [03:34] i had started with luckcolor's list of dead shorteners though so we can't re-crawl them :\ [03:35] except $dayjob [03:35] Frogging: ah, ok -- I think I got you confused with someone else [03:35] $dayjob is useful and important. At a minimum, it enables you to pay for #archivebot pipelines (which is *GREATLY* appreciated) [03:35] that is true :p [03:37] grumble -- I screwed up the alphabet on shortdoi-org [03:38] jessew: i think there are some 3 digit identifiers, i was going to update the wiki.. i forgot to add a note about doing a bit more research [03:38] yeah, I knew about the 3 digit identifiers -- the error I made was that there are identifiers that *start* with "a", so "a" can't be the first character in the alphabet [03:39] ah, cool [03:40] we may have missed various other synchronous ones with a similar error [03:40] someone (else) should probably go through them and check [03:41] spot check, I mean -- then I can fix the alphabets and we can grab them [03:43] ok, shortdoi-org queue up to 40 -- it should be done in a couple of hours [03:48] flip-it is *still* going, since June 6th -- 21,867,543 found [03:53] moar urls! [03:54] ah, _i_ messed up the alphabet you mean.. :) i didn't see '0', oops [03:55] no, there *isn't* an '0' -- I just need something at the beginning, because the first character in the alphabet doesn't get used as an initial letter in the generated shortcodes [03:55] and 'a' *is* used that way in shortdoi-org [03:55] ah, i gotcha now [03:56] yeah, it's basically just a bug in terroroftinytown [03:56] albeit one that (maybe) doesn't cause us to miss much (as it may be a similar bug in various of the shorteners we work over) [03:57] ah [04:14] *** Start_ has joined #urlteam [04:14] *** Start has quit IRC (Ping timeout: 260 seconds) [04:16] shortdoi-org is done [05:00] *** JesseW has quit IRC (Ping timeout: 370 seconds) [05:17] That was... short. [05:18] (sorry) [05:21] *drum hit* [05:22] it gets through the sequential shorteners pretty quickly from what i've seen [05:23] question from above: is there any way to massage the data we have for the dead shorteners into something that's usable for wayback/your 404 handler? [05:23] s/from/re/ [06:12] *** JesseW has joined #urlteam [06:38] The main issue is that we would like accurate crawl dates in WARC files. [06:39] so if we create WARC from another format, we'd like that info to be available. And, it does not exist in beacon. [06:39] they are dated approximately daily (although more recent ones are more like weekly) [06:41] but re-crawling all the non-dead ones is certainly a good idea, because as a nice side-effect, it would let us grab the target pages, too (which we currently don't) [06:41] So, if we're going to hand out affidavits to courts, as you can imagine "approximately" is not a good thing. [06:41] and indeed, it would be interesting to also have the target pages. [06:43] I'd hope that the question of "what exact minute did you check that this shortcode pointed to this address" wouldn't come up very often in affidavits -- but I can see the problem if there's no way to *mark* some entries as "circa" [06:43] Just as an example, we have a 80-billion-page horizontal crawl for which we have yet to figure out accurate dates for. This makes me very sad. [06:43] *** dashcloud has quit IRC (Ping timeout: 244 seconds) [06:43] And no, no existing "circa" system. [06:43] what about for the *very* early material (I'm thinking of the BBC website from the very early 1990s, that they gave you, and got converted) [06:43] I don't know about that one, yet. [06:44] * JesseW goes to try and dig up a link [06:44] Personally I'm eager to get some super-early Stanford crawls into the wayback, so that the initial Darwin Awards webpages are properly archived. [06:44] awesome [06:45] (the CS department asked Wendy to leave, because it was too popular. :-) ) [06:46] ha [06:46] http://www.forbes.com/sites/kalevleetaru/2015/11/16/how-much-of-the-internet-does-the-wayback-machine-really-archive <- interesting [06:48] Uh, yeah. I encourage anyone interested in that article to try out https://web-beta.archive.org/ because it provides a lot more info than was available when that article was written. [06:48] heh [06:48] *** dashcloud has joined #urlteam [06:49] *** svchfoo3 sets mode: +o dashcloud [06:51] Among other things, you can see how important ArchiveTeam is for many sites. I never appreciated you guys properly until I built that thing. [06:51] Now I'm a fan! [06:51] hehheheh [06:51] Albeit not a fan of non-WARC stuff. [06:52] yeah, urlteam is somewhat of a red-headed stepchild in someways [06:58] SketchCow mostly has you guys moving in the right direction, I'm just trying to fill in a few Wayback-specific details. [06:58] it's very welcome. [06:59] * JesseW is not finding the bbc collection I was thinking of -- I'll let you know if it pops up [07:01] *** dashcloud has quit IRC (Read error: Operation timed out) [07:04] it sounds like something that monitors urlteam exports, grabs them and generates WARCs might be worthwhile going forward? [07:05] that would likely be a good workaround -- eventually, I think integrating full capture into terroroftinytown would be better [07:06] but even before making something that monitors, just writing a Warrior project that goes through the existing urlteam exports and makes WARCs from them would be great [07:09] *** dashcloud has joined #urlteam [07:09] *** svchfoo3 sets mode: +o dashcloud [07:14] *** JesseW has quit IRC (Ping timeout: 370 seconds) [09:08] *** Fusl has quit IRC (Read error: Operation timed out) [09:12] *** WinterFox has joined #urlteam [11:40] JesseW i would suggest splitting the work [11:40] i mean [11:41] if we are going that ropute i suggest that we keep the current setup [11:41] and then have some servers that receive work [11:41] and print out warcs with only the 301 records [11:42] and then in other warcs we do archive only like crawl [11:42] I mean i would prefer to have the data separated: beacon, becaonwarc, archived urls warc, [11:59] *** Fusl has joined #urlteam [12:28] *** dashcloud has quit IRC (Read error: Operation timed out) [12:32] *** dashcloud has joined #urlteam [12:32] *** svchfoo3 sets mode: +o dashcloud [13:41] *** dashcloud has quit IRC (Read error: Operation timed out) [13:45] *** dashcloud has joined #urlteam [13:46] *** svchfoo3 sets mode: +o dashcloud [14:18] *** luckcolor has quit IRC (Remote host closed the connection) [14:19] *** luckcolor has joined #urlteam [14:24] *** luckcolor has quit IRC (Read error: Connection reset by peer) [14:25] *** luckcolor has joined #urlteam [14:33] *** luckcolor has quit IRC (Read error: Connection reset by peer) [14:33] *** luckcolor has joined #urlteam [14:43] *** luckcolor has quit IRC (Read error: Connection reset by peer) [14:55] *** luckcolor has joined #urlteam [15:21] *** WinterFox has quit IRC (Remote host closed the connection) [15:29] *** SilSte has quit IRC (Read error: Operation timed out) [15:30] *** JesseW has joined #urlteam [15:55] *** JesseW has quit IRC (Read error: Operation timed out) [16:10] luckcolor: that does make sense — but it does add load. Why not just save the full headers in the initial pass? (I agree about grabbing the targets in a separate pass) [16:10] *** Start_ is now known as Start [16:11] well in this manner we have the "simple to manage" text file with the lsit of urls because that's what beacon basically is and we also then have warc [16:11] i mean if we have a script to easily extract becaon or just a url list from warc then yesy we can omit the first part [16:13] * luckcolor needs url lists for the resolve that he's making [16:13] * luckcolor *resolver [16:14] yeah, I certainly would *produce* both beacon and WARCs [16:14] it's just a matter of whether the initial probing throws away the header info or not [16:14] no ofc not [16:15] well, it currently *does* — that's what I was thinking we should (eventually) fix [16:15] yeah the change would be so that the crawler will generate mini warcs [16:15] that we can collect [16:15] :P [16:16] yes [16:16] mini warcs team NEW Urlteam technology [16:16] *mini warcs! NEW Urlteam technology [16:19] if we are going this rate i don't reccomand using wpull [16:20] as it would be a hassle to succefully ship it to the crawlers [16:31] *rate > route [17:09] yes! mini warcs! [17:09] <3 [18:01] I know right [18:01] So tiny and cute warcs :P [18:02] like cocktail sausages [18:02] * luckcolor goes to lookup what are cocktail sausages [18:03] ah i see what you mean XD [18:06] *** VADemon has joined #urlteam [18:07] *** SilSte has joined #urlteam [18:33] so JesseW do you approve of the tiny warc technology (idea) for urlteam crawlers? :) [18:34] actually mini warcs [18:34] because mini is better [19:02] sure, works for me [20:58] *** JW_work has joined #urlteam [20:59] *** JW_work1 has quit IRC (Read error: Operation timed out) [22:17] *** SilSte has quit IRC (Read error: Operation timed out) [22:26] *** SilSte has joined #urlteam