#urlteam 2016-06-30,Thu

↑back Search

Time	Nickname	Message
01:00 ^🔗		JesseW has joined #urlteam
02:40 ^🔗	wumpus	If urlteam data did go into the wayback, then all the users of our upcoming 404 browser integrations can enjoy urlteam's work.
02:50 ^🔗	Frogging	it would be easy to put it into wayback, wouldn't it? isn't it all WARCs with 301 records?
02:52 ^🔗	bwn	it's in beacon link dump format
02:53 ^🔗	bwn	https://gbv.github.io/beaconspec/beacon.html
02:53 ^🔗		VADemon has quit IRC (Quit: left4dead)
02:57 ^🔗	xmc	i still can't believe that the stupid ad-hoc format that i came up with for my initial archiveteam experiments has become an internet-draft
02:57 ^🔗	xmc	makes me pretty sad
03:01 ^🔗	wumpus	no date info for WARC purposes
03:01 ^🔗	xmc	yeah it's a terrible format and i would like to go back in time and start saving warcs
03:06 ^🔗	Frogging	oh shit it's not warcs?
03:06 ^🔗	Frogging	not that I had any actual reason to think it was, but I'm surprised somewhat
03:07 ^🔗	JesseW	it's a lot smaller than WARCs would be
03:07 ^🔗	wumpus	You could recrawl the known IDs into WARCs, it's only 4.7 billion urls
03:08 ^🔗	JesseW	which I appreciate because it means I can mirror all of it more easily
03:08 ^🔗	JesseW	Yeah, I think re-crawling the known ones would be a great project.
03:08 ^🔗	JesseW	I just haven't gotten around to doing it
03:09 ^🔗	wumpus	Well, do you want the 404 handler of the web to process these links? Or to minimize your personal disk space? You can always go warc -> beacon ;-)
03:09 ^🔗	JesseW	(well, I have actively avoided working on that -- but I'm glad to cheer on others who do)
03:10 ^🔗	JesseW	that is a sensible argument for keeping the full headers from the terroroftinytown results, yes
03:12 ^🔗	*	JesseW is encouraged by all this discussion to look over the contributions people have made recently and see what can be added to the tracker
03:17 ^🔗	JesseW	shortdoi.org seems feasible to scrape
03:17 ^🔗	JesseW	making a project now
03:19 ^🔗	wumpus	the first urlteam to WARC project?
03:19 ^🔗	JesseW	ha -- no, just another one using the currently lossy format
03:19 ^🔗	wumpus	just kidding
03:20 ^🔗	JesseW	heh. but if you keep poking, I might make a PR (although I'd much prefer if you or someone else did so)
03:21 ^🔗	JesseW	ok, shortdoi-org started
03:22 ^🔗	JesseW	note, it actually maps from dx.doi.org/10/ not shortdoi.org, as that is just an initial redirect
03:24 ^🔗	JesseW	ok, 681 found so far
03:24 ^🔗	JesseW	it seems to have missed some, though, which I'm confused by
03:32 ^🔗	bwn	i've been working on aggregating the lists in my spare time, i was looking to do something similar to what you're discussing above (jessew: we were discussing it briefly a while back, a run through to make sure everything is in the wayback, archivebot-like thing to get them if not)
03:33 ^🔗	bwn	wb jessew, btw :)
03:33 ^🔗	JesseW	yeah, I remembered you were working on that
03:33 ^🔗	JesseW	I think someone else was, too -- maybe Frogging or vitzli (not sure if I'm misspelling their nick)...
03:34 ^🔗	Frogging	sadly I haven't been working on anything
03:34 ^🔗	bwn	i had started with luckcolor's list of dead shorteners though so we can't re-crawl them :\
03:35 ^🔗	Frogging	except $dayjob
03:35 ^🔗	JesseW	Frogging: ah, ok -- I think I got you confused with someone else
03:35 ^🔗	JesseW	$dayjob is useful and important. At a minimum, it enables you to pay for #archivebot pipelines (which is GREATLY appreciated)
03:35 ^🔗	Frogging	that is true :p
03:37 ^🔗	JesseW	grumble -- I screwed up the alphabet on shortdoi-org
03:38 ^🔗	bwn	jessew: i think there are some 3 digit identifiers, i was going to update the wiki.. i forgot to add a note about doing a bit more research
03:38 ^🔗	JesseW	yeah, I knew about the 3 digit identifiers -- the error I made was that there are identifiers that start with "a", so "a" can't be the first character in the alphabet
03:39 ^🔗	bwn	ah, cool
03:40 ^🔗	JesseW	we may have missed various other synchronous ones with a similar error
03:40 ^🔗	JesseW	someone (else) should probably go through them and check
03:41 ^🔗	JesseW	spot check, I mean -- then I can fix the alphabets and we can grab them
03:43 ^🔗	JesseW	ok, shortdoi-org queue up to 40 -- it should be done in a couple of hours
03:48 ^🔗	JesseW	flip-it is still going, since June 6th -- 21,867,543 found
03:53 ^🔗	bwn	moar urls!
03:54 ^🔗	bwn	ah, _i_ messed up the alphabet you mean.. :) i didn't see '0', oops
03:55 ^🔗	JesseW	no, there isn't an '0' -- I just need something at the beginning, because the first character in the alphabet doesn't get used as an initial letter in the generated shortcodes
03:55 ^🔗	JesseW	and 'a' is used that way in shortdoi-org
03:55 ^🔗	bwn	ah, i gotcha now
03:56 ^🔗	JesseW	yeah, it's basically just a bug in terroroftinytown
03:56 ^🔗	JesseW	albeit one that (maybe) doesn't cause us to miss much (as it may be a similar bug in various of the shorteners we work over)
03:57 ^🔗	bwn	ah
04:14 ^🔗		Start_ has joined #urlteam
04:14 ^🔗		Start has quit IRC (Ping timeout: 260 seconds)
04:16 ^🔗	JesseW	shortdoi-org is done
05:00 ^🔗		JesseW has quit IRC (Ping timeout: 370 seconds)
05:17 ^🔗	wumpus	That was... short.
05:18 ^🔗	wumpus	(sorry)
05:21 ^🔗	bwn	drum hit
05:22 ^🔗	bwn	it gets through the sequential shorteners pretty quickly from what i've seen
05:23 ^🔗	bwn	question from above: is there any way to massage the data we have for the dead shorteners into something that's usable for wayback/your 404 handler?
05:23 ^🔗	bwn	s/from/re/
06:12 ^🔗		JesseW has joined #urlteam
06:38 ^🔗	wumpus	The main issue is that we would like accurate crawl dates in WARC files.
06:39 ^🔗	wumpus	so if we create WARC from another format, we'd like that info to be available. And, it does not exist in beacon.
06:39 ^🔗	JesseW	they are dated approximately daily (although more recent ones are more like weekly)
06:41 ^🔗	JesseW	but re-crawling all the non-dead ones is certainly a good idea, because as a nice side-effect, it would let us grab the target pages, too (which we currently don't)
06:41 ^🔗	wumpus	So, if we're going to hand out affidavits to courts, as you can imagine "approximately" is not a good thing.
06:41 ^🔗	wumpus	and indeed, it would be interesting to also have the target pages.
06:43 ^🔗	JesseW	I'd hope that the question of "what exact minute did you check that this shortcode pointed to this address" wouldn't come up very often in affidavits -- but I can see the problem if there's no way to mark some entries as "circa"
06:43 ^🔗	wumpus	Just as an example, we have a 80-billion-page horizontal crawl for which we have yet to figure out accurate dates for. This makes me very sad.
06:43 ^🔗		dashcloud has quit IRC (Ping timeout: 244 seconds)
06:43 ^🔗	wumpus	And no, no existing "circa" system.
06:43 ^🔗	JesseW	what about for the very early material (I'm thinking of the BBC website from the very early 1990s, that they gave you, and got converted)
06:43 ^🔗	wumpus	I don't know about that one, yet.
06:44 ^🔗	*	JesseW goes to try and dig up a link
06:44 ^🔗	wumpus	Personally I'm eager to get some super-early Stanford crawls into the wayback, so that the initial Darwin Awards webpages are properly archived.
06:44 ^🔗	JesseW	awesome
06:45 ^🔗	wumpus	(the CS department asked Wendy to leave, because it was too popular. :-) )
06:46 ^🔗	JesseW	ha
06:46 ^🔗	JesseW	http://www.forbes.com/sites/kalevleetaru/2015/11/16/how-much-of-the-internet-does-the-wayback-machine-really-archive <- interesting
06:48 ^🔗	wumpus	Uh, yeah. I encourage anyone interested in that article to try out https://web-beta.archive.org/ because it provides a lot more info than was available when that article was written.
06:48 ^🔗	JesseW	heh
06:48 ^🔗		dashcloud has joined #urlteam
06:49 ^🔗		svchfoo3 sets mode: +o dashcloud
06:51 ^🔗	wumpus	Among other things, you can see how important ArchiveTeam is for many sites. I never appreciated you guys properly until I built that thing.
06:51 ^🔗	wumpus	Now I'm a fan!
06:51 ^🔗	JesseW	hehheheh
06:51 ^🔗	wumpus	Albeit not a fan of non-WARC stuff.
06:52 ^🔗	JesseW	yeah, urlteam is somewhat of a red-headed stepchild in someways
06:58 ^🔗	wumpus	SketchCow mostly has you guys moving in the right direction, I'm just trying to fill in a few Wayback-specific details.
06:58 ^🔗	JesseW	it's very welcome.
06:59 ^🔗	*	JesseW is not finding the bbc collection I was thinking of -- I'll let you know if it pops up
07:01 ^🔗		dashcloud has quit IRC (Read error: Operation timed out)
07:04 ^🔗	bwn	it sounds like something that monitors urlteam exports, grabs them and generates WARCs might be worthwhile going forward?
07:05 ^🔗	JesseW	that would likely be a good workaround -- eventually, I think integrating full capture into terroroftinytown would be better
07:06 ^🔗	JesseW	but even before making something that monitors, just writing a Warrior project that goes through the existing urlteam exports and makes WARCs from them would be great
07:09 ^🔗		dashcloud has joined #urlteam
07:09 ^🔗		svchfoo3 sets mode: +o dashcloud
07:14 ^🔗		JesseW has quit IRC (Ping timeout: 370 seconds)
09:08 ^🔗		Fusl has quit IRC (Read error: Operation timed out)
09:12 ^🔗		WinterFox has joined #urlteam
11:40 ^🔗	luckcolor	JesseW i would suggest splitting the work
11:40 ^🔗	luckcolor	i mean
11:41 ^🔗	luckcolor	if we are going that ropute i suggest that we keep the current setup
11:41 ^🔗	luckcolor	and then have some servers that receive work
11:41 ^🔗	luckcolor	and print out warcs with only the 301 records
11:42 ^🔗	luckcolor	and then in other warcs we do archive only like crawl
11:42 ^🔗	luckcolor	I mean i would prefer to have the data separated: beacon, becaonwarc, archived urls warc,
11:59 ^🔗		Fusl has joined #urlteam
12:28 ^🔗		dashcloud has quit IRC (Read error: Operation timed out)
12:32 ^🔗		dashcloud has joined #urlteam
12:32 ^🔗		svchfoo3 sets mode: +o dashcloud
13:41 ^🔗		dashcloud has quit IRC (Read error: Operation timed out)
13:45 ^🔗		dashcloud has joined #urlteam
13:46 ^🔗		svchfoo3 sets mode: +o dashcloud
14:18 ^🔗		luckcolor has quit IRC (Remote host closed the connection)
14:19 ^🔗		luckcolor has joined #urlteam
14:24 ^🔗		luckcolor has quit IRC (Read error: Connection reset by peer)
14:25 ^🔗		luckcolor has joined #urlteam
14:33 ^🔗		luckcolor has quit IRC (Read error: Connection reset by peer)
14:33 ^🔗		luckcolor has joined #urlteam
14:43 ^🔗		luckcolor has quit IRC (Read error: Connection reset by peer)
14:55 ^🔗		luckcolor has joined #urlteam
15:21 ^🔗		WinterFox has quit IRC (Remote host closed the connection)
15:29 ^🔗		SilSte has quit IRC (Read error: Operation timed out)
15:30 ^🔗		JesseW has joined #urlteam
15:55 ^🔗		JesseW has quit IRC (Read error: Operation timed out)
16:10 ^🔗	JW_work1	luckcolor: that does make sense — but it does add load. Why not just save the full headers in the initial pass? (I agree about grabbing the targets in a separate pass)
16:10 ^🔗		Start_ is now known as Start
16:11 ^🔗	luckcolor	well in this manner we have the "simple to manage" text file with the lsit of urls because that's what beacon basically is and we also then have warc
16:11 ^🔗	luckcolor	i mean if we have a script to easily extract becaon or just a url list from warc then yesy we can omit the first part
16:13 ^🔗	*	luckcolor needs url lists for the resolve that he's making
16:13 ^🔗	*	luckcolor *resolver
16:14 ^🔗	JW_work1	yeah, I certainly would produce both beacon and WARCs
16:14 ^🔗	JW_work1	it's just a matter of whether the initial probing throws away the header info or not
16:14 ^🔗	luckcolor	no ofc not
16:15 ^🔗	JW_work1	well, it currently does — that's what I was thinking we should (eventually) fix
16:15 ^🔗	luckcolor	yeah the change would be so that the crawler will generate mini warcs
16:15 ^🔗	luckcolor	that we can collect
16:15 ^🔗	luckcolor	:P
16:16 ^🔗	JW_work1	yes
16:16 ^🔗	luckcolor	mini warcs team NEW Urlteam technology
16:16 ^🔗	luckcolor	*mini warcs! NEW Urlteam technology
16:19 ^🔗	luckcolor	if we are going this rate i don't reccomand using wpull
16:20 ^🔗	luckcolor	as it would be a hassle to succefully ship it to the crawlers
16:31 ^🔗	luckcolor	*rate > route
17:09 ^🔗	xmc	yes! mini warcs!
17:09 ^🔗	xmc	<N3
17:09 ^🔗	xmc	<3
18:01 ^🔗	luckcolor	I know right
18:01 ^🔗	luckcolor	So tiny and cute warcs :P
18:02 ^🔗	xmc	like cocktail sausages
18:02 ^🔗	*	luckcolor goes to lookup what are cocktail sausages
18:03 ^🔗	luckcolor	ah i see what you mean XD
18:06 ^🔗		VADemon has joined #urlteam
18:07 ^🔗		SilSte has joined #urlteam
18:33 ^🔗	luckcolor	so JesseW do you approve of the tiny warc technology (idea) for urlteam crawlers? :)
18:34 ^🔗	luckcolor	actually mini warcs
18:34 ^🔗	luckcolor	because mini is better
19:02 ^🔗	JW_work1	sure, works for me
20:58 ^🔗		JW_work has joined #urlteam
20:59 ^🔗		JW_work1 has quit IRC (Read error: Operation timed out)
22:17 ^🔗		SilSte has quit IRC (Read error: Operation timed out)
22:26 ^🔗		SilSte has joined #urlteam

irclogger-viewer