#archiveteam-bs 2016-12-19,Mon

↑back Search

Time	Nickname	Message
00:17 ^🔗		Mayeau is now known as Mayonaise
00:36 ^🔗		kristian_ has quit IRC (Quit: Leaving)
01:23 ^🔗		dashcloud has quit IRC (Read error: Operation timed out)
01:26 ^🔗		dashcloud has joined #archiveteam-bs
01:31 ^🔗		krazedkat has quit IRC (Quit: Leaving)
02:19 ^🔗		VADemon has quit IRC (Quit: left4dead)
03:42 ^🔗		jrwr has quit IRC (Remote host closed the connection)
04:30 ^🔗		ravetcofx has quit IRC (Ping timeout: 1208 seconds)
04:40 ^🔗		ravetcofx has joined #archiveteam-bs
04:55 ^🔗	alembic	is bayimg all backed up?
05:28 ^🔗		Sk1d has quit IRC (Ping timeout: 194 seconds)
05:34 ^🔗		Sk1d has joined #archiveteam-bs
06:09 ^🔗		ravetcofx has quit IRC (Read error: Operation timed out)
06:50 ^🔗		Somebody1 has quit IRC (Read error: Operation timed out)
07:22 ^🔗		Somebody1 has joined #archiveteam-bs
08:06 ^🔗	Medowar	alembic: yes
08:11 ^🔗	PurpleSym	Short braindump: The URL discovery problem is going to be one of our main concern when archiving things (cf dropbox), so can we build our own URL database/site explorer?
08:11 ^🔗	PurpleSym	We already got a lot of data from urlteam. And we have a lot of WARCs we could extract URL from that were considered “off site” at the time of the crawl.
08:14 ^🔗		Start has quit IRC (Quit: Disconnected.)
08:15 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)
08:21 ^🔗		schbirid has joined #archiveteam-bs
08:33 ^🔗	Somebody1	PurpleSym: I'd be delighted if someone took the initiative to convert (and even better, keep up to date) an indexed database of all the URLs in the URLTeam and ArchiveBot collections. It'd be non-trivial in size, but not too big, I think.
08:34 ^🔗	PurpleSym	What size are we talking about here, Somebody1? GB? TB?
08:37 ^🔗	yipdw	you can get an idea by downloading all the archivebot CDXes
08:38 ^🔗	yipdw	(incidentally that's also a great start to an indexed database)
08:42 ^🔗		Somebody1 has quit IRC (Ping timeout: 370 seconds)
08:43 ^🔗	PurpleSym	That’s a good idea. I’ll try to get a size estimate from the CDXes.
08:46 ^🔗		wm_ has quit IRC (Ping timeout: 260 seconds)
08:46 ^🔗		midas1 has quit IRC (Ping timeout: 260 seconds)
08:47 ^🔗		Kenshin has quit IRC (Read error: Connection reset by peer)
08:47 ^🔗		Kenshin has joined #archiveteam-bs
08:48 ^🔗		midas1 has joined #archiveteam-bs
08:51 ^🔗		Somebody1 has joined #archiveteam-bs
08:52 ^🔗	Somebody1	PurpleSym: IDK -- probably in the couple of GB range, I think...?
08:52 ^🔗	PurpleSym	That sounds manageable.
08:54 ^🔗		wm_ has joined #archiveteam-bs
09:05 ^🔗	Somebody1	so the (compressed) URLTeam data is a total of 200G
09:05 ^🔗	Somebody1	but the format is silly, so storing it in a more sensible way might bring the size down
09:06 ^🔗	Somebody1	IDK how big the ArchiveBot collection is
09:25 ^🔗	HCross	ps. Well done for coming into something thats been running for years, and going "HERES A BETTER WAY.... ITS BETTER..." There is method in the madness
09:31 ^🔗	yipdw	it's the same data we've always had
09:34 ^🔗		DiscantX has joined #archiveteam-bs
09:34 ^🔗	PurpleSym	URLteam, yes, but we do not extract all URLs from WARCs, do we? The CDXes only contain URLs archivebot actually retrieved.
09:35 ^🔗	yipdw	a cdx for a WARC is a full index of that WARC so uh
09:35 ^🔗	yipdw	I'm not sure what the question is
09:36 ^🔗	PurpleSym	The HTML inside the WARC may contain links to sites that were not crawled due to exceptions or recursion limits.
09:37 ^🔗	PurpleSym	These are not in the CDX.
09:37 ^🔗	yipdw	right
09:38 ^🔗	PurpleSym	These are the URLs I was talking about.
09:38 ^🔗	yipdw	but that's fine because that URL doesn't exist in the WARC
09:39 ^🔗	yipdw	oh ok
09:40 ^🔗	yipdw	I mean, there's ways of varying accuracy to get those URLs
09:47 ^🔗	Somebody1	HCross: which topic was your comment in reference to?
09:48 ^🔗		GE has joined #archiveteam-bs
09:49 ^🔗	Somebody1	yipdw: I was thinking something as simple and stupid as looking a regex: https?://[^"]+
09:49 ^🔗	yipdw	you'll want scheme-relative URLs and relative URLs
09:49 ^🔗	yipdw	lots of webapps generate those
09:50 ^🔗	Somebody1	ah, very good point -- and those are somewhat more of a PITA (although still not that bad)
09:51 ^🔗	yipdw	that's more like covering a case where you might have a partial failure on a single site but I guess if you don't trust the CDXes you can't assume that either
09:51 ^🔗		krazedkat has joined #archiveteam-bs
09:52 ^🔗	yipdw	come to think of it, that's not uncommon
09:53 ^🔗	Somebody1	Hm, it doesn't look like the ArchiveBot JSON records whether or not offsite-links were grabbed: https://ia801200.us.archive.org/14/items/archiveteam_archivebot_go_20160802210001/0cch.com-inf-20160801-163213-5t2w6.json
09:53 ^🔗	yipdw	those don't, the warcinfo record does
09:54 ^🔗	HCross	because all we get is ShortURL = LongURL (at least I think)
09:54 ^🔗	Somebody1	HCross: ?
09:56 ^🔗	Somebody1	URLteam does only store mappings from short to long URLs, yes -- but I'm still confused about how this relates...?
09:58 ^🔗	yipdw	specifically, if you see --span-hosts-allow page-requisites,linked-pages in the Wpull-Argv field of the warcinfo record, --no-offsite-links was off
09:58 ^🔗		Chii has joined #archiveteam-bs
09:58 ^🔗	ranma	anybody have a fave SC story (such as https://youtu.be/6r65WwO1-uA?t=462 )?
09:58 ^🔗	Chii	So You Want To Murder a Software Patent Jason Scott - [51m15s] 2014-09-27 - Adrian Crenshaw - 33,605 views
09:58 ^🔗	yipdw	the JSON file is kind of useless for now
09:58 ^🔗	ranma	;part
09:58 ^🔗		Chii has left
09:59 ^🔗	ranma	car crash / hotel anecdote
10:00 ^🔗	*	ranma ducks
10:02 ^🔗	Somebody1	yipdw: good to know about the Wpull-Argv field though
10:02 ^🔗	yipdw	in a lot of cases the warcinfo record is the best place to go first
10:03 ^🔗	yipdw	it's the first record of each wpull warc, so you can grab it by asking for e.g. the first 100 kB of the WARC
10:03 ^🔗	yipdw	there's probably a more refined way to get just the first record
10:03 ^🔗	yipdw	I think the only thing you can't get from that is the pipeline and job initiator nicks
10:03 ^🔗	yipdw	but really who cares about those
10:23 ^🔗		Somebody1 has quit IRC (Ping timeout: 370 seconds)
10:37 ^🔗		GE has quit IRC (Remote host closed the connection)
11:12 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
11:15 ^🔗		BartoCH has joined #archiveteam-bs
11:48 ^🔗		Igloo has quit IRC (Ping timeout: 250 seconds)
11:49 ^🔗		Igloo has joined #archiveteam-bs
12:15 ^🔗		GE has joined #archiveteam-bs
12:19 ^🔗		GE has quit IRC (Remote host closed the connection)
12:39 ^🔗		DiscantX has quit IRC (Ping timeout: 244 seconds)
13:04 ^🔗		DiscantX has joined #archiveteam-bs
13:10 ^🔗		VADemon has joined #archiveteam-bs
13:29 ^🔗		Start has joined #archiveteam-bs
13:38 ^🔗		Start has quit IRC (Ping timeout: 506 seconds)
14:14 ^🔗		Kaz\| has joined #archiveteam-bs
14:14 ^🔗		Kaz\| has quit IRC (Client Quit)
14:33 ^🔗		Kaz has quit IRC (Quit: boop)
14:35 ^🔗		Kaz has joined #archiveteam-bs
14:44 ^🔗		Kaz has quit IRC (Read error: Connection reset by peer)
14:46 ^🔗		Kaz has joined #archiveteam-bs
15:22 ^🔗		DiscantX has quit IRC (Ping timeout: 244 seconds)
15:35 ^🔗		Start has joined #archiveteam-bs
15:38 ^🔗		Start has quit IRC (Client Quit)
15:40 ^🔗		Start has joined #archiveteam-bs
15:40 ^🔗		Start has quit IRC (Client Quit)
16:08 ^🔗		jspiros has quit IRC (Read error: Operation timed out)
16:12 ^🔗		jspiros has joined #archiveteam-bs
16:31 ^🔗		GE has joined #archiveteam-bs
16:57 ^🔗		Somebody1 has joined #archiveteam-bs
17:23 ^🔗		Somebody1 has quit IRC (Ping timeout: 370 seconds)
17:38 ^🔗		vitzli has joined #archiveteam-bs
17:49 ^🔗		Jonimus has joined #archiveteam-bs
17:49 ^🔗		swebb sets mode: +o Jonimus
17:50 ^🔗	Jonimus	So my workplaces hosting provider apparently screwed something up a few months back such that our website has been down for quite a while and so we said screw them and moved to a simple squarespace site
17:50 ^🔗	Jonimus	but it looks like the wayback machine does not have a complete crawl of our old website from any recent time.
17:52 ^🔗	Jonimus	If I fudge my /etc/hosts such that I can get the old site to work and do warc/crawl is there any way to back date it and get it into the wayback machine as if it was a few days ago?
17:52 ^🔗	Jonimus	or is there a way to have archivebot do this?
17:54 ^🔗	Frogging	I don't think that's allowed
17:55 ^🔗	Smiley	no
17:55 ^🔗	Smiley	you don't get to edit history
17:56 ^🔗		VADemon has quit IRC (Read error: Connection reset by peer)
17:57 ^🔗	xmc	but please do that and upload a warc anyway. it just won't go into wayback, but it's still good to have!
17:58 ^🔗	Frogging	yeah, by all means do a crawl. It's just Wayback that requires authenticity
18:02 ^🔗	Jonimus	I figured as much, just thought I'd ask.
18:02 ^🔗	xmc	oh, hi Jonimus. didn't recognize your name for some reason.
18:05 ^🔗		vitzli has quit IRC (Quit: Leaving)
18:06 ^🔗	Jonimus	oh, hi, yeah I lurk in #archiveteam but I figured this was more of a -bs question
18:06 ^🔗	Jonimus	side not, what are the wget paramters I'd want for good warc creation?
18:08 ^🔗	xmc	1 sec
18:09 ^🔗	xmc	we have some suggestions here http://archiveteam.org/index.php?title=Wget
18:09 ^🔗	xmc	specifically the "creating warc with wget" section
18:10 ^🔗	xmc	if you're not opposed to installing new software, wpull is somewhat better than wget
18:10 ^🔗	xmc	https://github.com/chfoo/wpull
18:10 ^🔗	xmc	it allows you to stop and resume gracefully
18:10 ^🔗	xmc	(among other things)
18:13 ^🔗	Frogging	the wpull args can be a lot to swallow, I use ArchiveBot as a reference
18:13 ^🔗	Frogging	https://github.com/ArchiveTeam/ArchiveBot/blob/master/pipeline/archivebot/seesaw/wpull.py#L22
18:19 ^🔗	Jonimus	yeah well it looks like they still have broken things because when doing wget http://ip --header "Host: actualdomain.com" I get a 500 error
18:21 ^🔗	Jonimus	so I'm not sure I'll be able to do an actual crawl correctly because it be broke.
18:21 ^🔗	xmc	maybe it has two Host headers in the requests
18:29 ^🔗	Jonimus	I checked and that does not appear to be the case as of wget 1.10, and this system as 1.17.1 with warc support.
18:29 ^🔗	xmc	hm ok
18:33 ^🔗	Jonimus	yeah it seems our old go-daddy reseller really broke things well, there is a reason we were gonna switch..
18:43 ^🔗	Jonimus	Well it looks like it will be absically impossible to do this without fighting the host to fix their shit so there goes that idea.
18:44 ^🔗	Jonimus	I was able to make a backup of the site via ftp but that include privite info so I can't really upload it to archive.org without spending a bunch of time removing the private stuff.
19:36 ^🔗		Start has joined #archiveteam-bs
20:07 ^🔗		Aranje has joined #archiveteam-bs
20:58 ^🔗		Start has quit IRC (Quit: Disconnected.)
21:02 ^🔗		Start has joined #archiveteam-bs
21:03 ^🔗		Start has quit IRC (Client Quit)
21:16 ^🔗		Start has joined #archiveteam-bs
21:33 ^🔗		Start has quit IRC (Quit: Disconnected.)
21:47 ^🔗		schbirid has quit IRC (Quit: Leaving)
22:07 ^🔗		BlueMaxim has joined #archiveteam-bs
22:10 ^🔗		FalconK has joined #archiveteam-bs
22:21 ^🔗		Aranje has quit IRC (Ping timeout: 260 seconds)
22:21 ^🔗		Start has joined #archiveteam-bs
22:22 ^🔗		Start has quit IRC (Client Quit)
22:22 ^🔗		Madthias has joined #archiveteam-bs
22:39 ^🔗	godane	so grawker sitemap grabs are up to 2016-11-30 now
23:24 ^🔗		Ravenloft has joined #archiveteam-bs
23:29 ^🔗		Ravenloft has quit IRC (Ping timeout: 272 seconds)

irclogger-viewer