#urlteam 2016-06-30,Thu

↑back Search

Time Nickname Message
01:00 🔗 JesseW has joined #urlteam
02:40 🔗 wumpus If urlteam data did go into the wayback, then all the users of our upcoming 404 browser integrations can enjoy urlteam's work.
02:50 🔗 Frogging it would be easy to put it into wayback, wouldn't it? isn't it all WARCs with 301 records?
02:52 🔗 bwn it's in beacon link dump format
02:53 🔗 bwn https://gbv.github.io/beaconspec/beacon.html
02:53 🔗 VADemon has quit IRC (Quit: left4dead)
02:57 🔗 xmc i still can't believe that the stupid ad-hoc format that i came up with for my initial archiveteam experiments has become an internet-draft
02:57 🔗 xmc makes me pretty sad
03:01 🔗 wumpus no date info for WARC purposes
03:01 🔗 xmc yeah it's a terrible format and i would like to go back in time and start saving warcs
03:06 🔗 Frogging oh shit it's not warcs?
03:06 🔗 Frogging not that I had any actual reason to think it was, but I'm surprised somewhat
03:07 🔗 JesseW it's a lot smaller than WARCs would be
03:07 🔗 wumpus You could recrawl the known IDs into WARCs, it's only 4.7 billion urls
03:08 🔗 JesseW which I appreciate because it means I can mirror all of it more easily
03:08 🔗 JesseW Yeah, I think re-crawling the known ones would be a great project.
03:08 🔗 JesseW I just haven't gotten around to doing it
03:09 🔗 wumpus Well, do you want the 404 handler of the web to process these links? Or to minimize your personal disk space? You can always go warc -> beacon ;-)
03:09 🔗 JesseW (well, I have actively avoided working on that -- but I'm glad to cheer on others who do)
03:10 🔗 JesseW that is a sensible argument for keeping the full headers from the terroroftinytown results, yes
03:12 🔗 * JesseW is encouraged by all this discussion to look over the contributions people have made recently and see what can be added to the tracker
03:17 🔗 JesseW shortdoi.org seems feasible to scrape
03:17 🔗 JesseW making a project now
03:19 🔗 wumpus the first urlteam to WARC project?
03:19 🔗 JesseW ha -- no, just another one using the currently lossy format
03:19 🔗 wumpus just kidding
03:20 🔗 JesseW heh. but if you keep poking, I might make a PR (although I'd much prefer if you or someone else did so)
03:21 🔗 JesseW ok, shortdoi-org started
03:22 🔗 JesseW note, it actually maps from dx.doi.org/10/ not shortdoi.org, as that is just an initial redirect
03:24 🔗 JesseW ok, 681 found so far
03:24 🔗 JesseW it seems to have missed some, though, which I'm confused by
03:32 🔗 bwn i've been working on aggregating the lists in my spare time, i was looking to do something similar to what you're discussing above (jessew: we were discussing it briefly a while back, a run through to make sure everything is in the wayback, archivebot-like thing to get them if not)
03:33 🔗 bwn wb jessew, btw :)
03:33 🔗 JesseW yeah, I remembered you were working on that
03:33 🔗 JesseW I think someone else was, too -- maybe Frogging or vitzli (not sure if I'm misspelling their nick)...
03:34 🔗 Frogging sadly I haven't been working on anything
03:34 🔗 bwn i had started with luckcolor's list of dead shorteners though so we can't re-crawl them :\
03:35 🔗 Frogging except $dayjob
03:35 🔗 JesseW Frogging: ah, ok -- I think I got you confused with someone else
03:35 🔗 JesseW $dayjob is useful and important. At a minimum, it enables you to pay for #archivebot pipelines (which is *GREATLY* appreciated)
03:35 🔗 Frogging that is true :p
03:37 🔗 JesseW grumble -- I screwed up the alphabet on shortdoi-org
03:38 🔗 bwn jessew: i think there are some 3 digit identifiers, i was going to update the wiki.. i forgot to add a note about doing a bit more research
03:38 🔗 JesseW yeah, I knew about the 3 digit identifiers -- the error I made was that there are identifiers that *start* with "a", so "a" can't be the first character in the alphabet
03:39 🔗 bwn ah, cool
03:40 🔗 JesseW we may have missed various other synchronous ones with a similar error
03:40 🔗 JesseW someone (else) should probably go through them and check
03:41 🔗 JesseW spot check, I mean -- then I can fix the alphabets and we can grab them
03:43 🔗 JesseW ok, shortdoi-org queue up to 40 -- it should be done in a couple of hours
03:48 🔗 JesseW flip-it is *still* going, since June 6th -- 21,867,543 found
03:53 🔗 bwn moar urls!
03:54 🔗 bwn ah, _i_ messed up the alphabet you mean.. :) i didn't see '0', oops
03:55 🔗 JesseW no, there *isn't* an '0' -- I just need something at the beginning, because the first character in the alphabet doesn't get used as an initial letter in the generated shortcodes
03:55 🔗 JesseW and 'a' *is* used that way in shortdoi-org
03:55 🔗 bwn ah, i gotcha now
03:56 🔗 JesseW yeah, it's basically just a bug in terroroftinytown
03:56 🔗 JesseW albeit one that (maybe) doesn't cause us to miss much (as it may be a similar bug in various of the shorteners we work over)
03:57 🔗 bwn ah
04:14 🔗 Start_ has joined #urlteam
04:14 🔗 Start has quit IRC (Ping timeout: 260 seconds)
04:16 🔗 JesseW shortdoi-org is done
05:00 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
05:17 🔗 wumpus That was... short.
05:18 🔗 wumpus (sorry)
05:21 🔗 bwn *drum hit*
05:22 🔗 bwn it gets through the sequential shorteners pretty quickly from what i've seen
05:23 🔗 bwn question from above: is there any way to massage the data we have for the dead shorteners into something that's usable for wayback/your 404 handler?
05:23 🔗 bwn s/from/re/
06:12 🔗 JesseW has joined #urlteam
06:38 🔗 wumpus The main issue is that we would like accurate crawl dates in WARC files.
06:39 🔗 wumpus so if we create WARC from another format, we'd like that info to be available. And, it does not exist in beacon.
06:39 🔗 JesseW they are dated approximately daily (although more recent ones are more like weekly)
06:41 🔗 JesseW but re-crawling all the non-dead ones is certainly a good idea, because as a nice side-effect, it would let us grab the target pages, too (which we currently don't)
06:41 🔗 wumpus So, if we're going to hand out affidavits to courts, as you can imagine "approximately" is not a good thing.
06:41 🔗 wumpus and indeed, it would be interesting to also have the target pages.
06:43 🔗 JesseW I'd hope that the question of "what exact minute did you check that this shortcode pointed to this address" wouldn't come up very often in affidavits -- but I can see the problem if there's no way to *mark* some entries as "circa"
06:43 🔗 wumpus Just as an example, we have a 80-billion-page horizontal crawl for which we have yet to figure out accurate dates for. This makes me very sad.
06:43 🔗 dashcloud has quit IRC (Ping timeout: 244 seconds)
06:43 🔗 wumpus And no, no existing "circa" system.
06:43 🔗 JesseW what about for the *very* early material (I'm thinking of the BBC website from the very early 1990s, that they gave you, and got converted)
06:43 🔗 wumpus I don't know about that one, yet.
06:44 🔗 * JesseW goes to try and dig up a link
06:44 🔗 wumpus Personally I'm eager to get some super-early Stanford crawls into the wayback, so that the initial Darwin Awards webpages are properly archived.
06:44 🔗 JesseW awesome
06:45 🔗 wumpus (the CS department asked Wendy to leave, because it was too popular. :-) )
06:46 🔗 JesseW ha
06:46 🔗 JesseW http://www.forbes.com/sites/kalevleetaru/2015/11/16/how-much-of-the-internet-does-the-wayback-machine-really-archive <- interesting
06:48 🔗 wumpus Uh, yeah. I encourage anyone interested in that article to try out https://web-beta.archive.org/ because it provides a lot more info than was available when that article was written.
06:48 🔗 JesseW heh
06:48 🔗 dashcloud has joined #urlteam
06:49 🔗 svchfoo3 sets mode: +o dashcloud
06:51 🔗 wumpus Among other things, you can see how important ArchiveTeam is for many sites. I never appreciated you guys properly until I built that thing.
06:51 🔗 wumpus Now I'm a fan!
06:51 🔗 JesseW hehheheh
06:51 🔗 wumpus Albeit not a fan of non-WARC stuff.
06:52 🔗 JesseW yeah, urlteam is somewhat of a red-headed stepchild in someways
06:58 🔗 wumpus SketchCow mostly has you guys moving in the right direction, I'm just trying to fill in a few Wayback-specific details.
06:58 🔗 JesseW it's very welcome.
06:59 🔗 * JesseW is not finding the bbc collection I was thinking of -- I'll let you know if it pops up
07:01 🔗 dashcloud has quit IRC (Read error: Operation timed out)
07:04 🔗 bwn it sounds like something that monitors urlteam exports, grabs them and generates WARCs might be worthwhile going forward?
07:05 🔗 JesseW that would likely be a good workaround -- eventually, I think integrating full capture into terroroftinytown would be better
07:06 🔗 JesseW but even before making something that monitors, just writing a Warrior project that goes through the existing urlteam exports and makes WARCs from them would be great
07:09 🔗 dashcloud has joined #urlteam
07:09 🔗 svchfoo3 sets mode: +o dashcloud
07:14 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
09:08 🔗 Fusl has quit IRC (Read error: Operation timed out)
09:12 🔗 WinterFox has joined #urlteam
11:40 🔗 luckcolor JesseW i would suggest splitting the work
11:40 🔗 luckcolor i mean
11:41 🔗 luckcolor if we are going that ropute i suggest that we keep the current setup
11:41 🔗 luckcolor and then have some servers that receive work
11:41 🔗 luckcolor and print out warcs with only the 301 records
11:42 🔗 luckcolor and then in other warcs we do archive only like crawl
11:42 🔗 luckcolor I mean i would prefer to have the data separated: beacon, becaonwarc, archived urls warc,
11:59 🔗 Fusl has joined #urlteam
12:28 🔗 dashcloud has quit IRC (Read error: Operation timed out)
12:32 🔗 dashcloud has joined #urlteam
12:32 🔗 svchfoo3 sets mode: +o dashcloud
13:41 🔗 dashcloud has quit IRC (Read error: Operation timed out)
13:45 🔗 dashcloud has joined #urlteam
13:46 🔗 svchfoo3 sets mode: +o dashcloud
14:18 🔗 luckcolor has quit IRC (Remote host closed the connection)
14:19 🔗 luckcolor has joined #urlteam
14:24 🔗 luckcolor has quit IRC (Read error: Connection reset by peer)
14:25 🔗 luckcolor has joined #urlteam
14:33 🔗 luckcolor has quit IRC (Read error: Connection reset by peer)
14:33 🔗 luckcolor has joined #urlteam
14:43 🔗 luckcolor has quit IRC (Read error: Connection reset by peer)
14:55 🔗 luckcolor has joined #urlteam
15:21 🔗 WinterFox has quit IRC (Remote host closed the connection)
15:29 🔗 SilSte has quit IRC (Read error: Operation timed out)
15:30 🔗 JesseW has joined #urlteam
15:55 🔗 JesseW has quit IRC (Read error: Operation timed out)
16:10 🔗 JW_work1 luckcolor: that does make sense — but it does add load. Why not just save the full headers in the initial pass? (I agree about grabbing the targets in a separate pass)
16:10 🔗 Start_ is now known as Start
16:11 🔗 luckcolor well in this manner we have the "simple to manage" text file with the lsit of urls because that's what beacon basically is and we also then have warc
16:11 🔗 luckcolor i mean if we have a script to easily extract becaon or just a url list from warc then yesy we can omit the first part
16:13 🔗 * luckcolor needs url lists for the resolve that he's making
16:13 🔗 * luckcolor *resolver
16:14 🔗 JW_work1 yeah, I certainly would *produce* both beacon and WARCs
16:14 🔗 JW_work1 it's just a matter of whether the initial probing throws away the header info or not
16:14 🔗 luckcolor no ofc not
16:15 🔗 JW_work1 well, it currently *does* — that's what I was thinking we should (eventually) fix
16:15 🔗 luckcolor yeah the change would be so that the crawler will generate mini warcs
16:15 🔗 luckcolor that we can collect
16:15 🔗 luckcolor :P
16:16 🔗 JW_work1 yes
16:16 🔗 luckcolor mini warcs team NEW Urlteam technology
16:16 🔗 luckcolor *mini warcs! NEW Urlteam technology
16:19 🔗 luckcolor if we are going this rate i don't reccomand using wpull
16:20 🔗 luckcolor as it would be a hassle to succefully ship it to the crawlers
16:31 🔗 luckcolor *rate > route
17:09 🔗 xmc yes! mini warcs!
17:09 🔗 xmc <N3
17:09 🔗 xmc <3
18:01 🔗 luckcolor I know right
18:01 🔗 luckcolor So tiny and cute warcs :P
18:02 🔗 xmc like cocktail sausages
18:02 🔗 * luckcolor goes to lookup what are cocktail sausages
18:03 🔗 luckcolor ah i see what you mean XD
18:06 🔗 VADemon has joined #urlteam
18:07 🔗 SilSte has joined #urlteam
18:33 🔗 luckcolor so JesseW do you approve of the tiny warc technology (idea) for urlteam crawlers? :)
18:34 🔗 luckcolor actually mini warcs
18:34 🔗 luckcolor because mini is better
19:02 🔗 JW_work1 sure, works for me
20:58 🔗 JW_work has joined #urlteam
20:59 🔗 JW_work1 has quit IRC (Read error: Operation timed out)
22:17 🔗 SilSte has quit IRC (Read error: Operation timed out)
22:26 🔗 SilSte has joined #urlteam

irclogger-viewer