#archiveteam 2016-08-23,Tue

↑back Search

Time	Nickname	Message
00:21 ^🔗		kristian_ has quit IRC (Leaving)
00:39 ^🔗	aschmitz	Has anyone done any work on NPR's comments?
00:54 ^🔗	r3c0d3x	Asked about this a few days back, didn't get any response, so I'd assume no.
01:47 ^🔗		HCross has quit IRC (Ping timeout: 246 seconds)
01:47 ^🔗		HCross has joined #archiveteam
01:54 ^🔗		khaoohs has joined #archiveteam
02:03 ^🔗		khaoohs has quit IRC (Quit: Leaving)
02:10 ^🔗		tomwsmf has quit IRC (Read error: Operation timed out)
02:30 ^🔗		mr-b has left
02:45 ^🔗		db48x has joined #archiveteam
03:10 ^🔗		db48x` has joined #archiveteam
03:11 ^🔗		db48x has quit IRC (Read error: Operation timed out)
03:14 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
03:15 ^🔗		BartoCH has joined #archiveteam
03:22 ^🔗		nicolas17 has quit IRC (Quit: U+1F634)
04:09 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
04:12 ^🔗		BartoCH has joined #archiveteam
04:17 ^🔗		JesseW has joined #archiveteam
04:17 ^🔗		Sk1d has quit IRC (Ping timeout: 250 seconds)
04:24 ^🔗		Sk1d has joined #archiveteam
04:26 ^🔗	JesseW	we should probably get all the sites we can from http://www.users.totalise.co.uk as it appears to be a small ISP, in the process of being merged with another one (although they don't explicitly talk about shutting down the web sites)
04:29 ^🔗	JesseW	!ig 28j6lpt5lmtyrdi4dhfugpmto squarespace\.com
04:35 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
04:35 ^🔗		HCross has quit IRC (Ping timeout: 246 seconds)
04:35 ^🔗		HCross has joined #archiveteam
04:43 ^🔗		DFJustin has quit IRC (Ping timeout: 260 seconds)
04:43 ^🔗		Meroje has quit IRC (Quit: bye!)
04:44 ^🔗		Meroje has joined #archiveteam
04:53 ^🔗		DFJustin has joined #archiveteam
04:53 ^🔗		swebb sets mode: +o DFJustin
05:05 ^🔗		DFJustin has quit IRC (Remote host closed the connection)
05:10 ^🔗		DFJustin has joined #archiveteam
05:15 ^🔗		HCross has quit IRC (Read error: Operation timed out)
05:15 ^🔗		HCross has joined #archiveteam
05:45 ^🔗	JesseW	I'm in the process of grabbing the ones I can find with archivebot
05:51 ^🔗		quails has quit IRC (Ping timeout: 250 seconds)
05:56 ^🔗		quails has joined #archiveteam
05:57 ^🔗		phuzion has quit IRC (Read error: Operation timed out)
05:58 ^🔗		phuzion has joined #archiveteam
06:04 ^🔗		patrickod has quit IRC (Read error: Operation timed out)
06:04 ^🔗		patrickod has joined #archiveteam
06:05 ^🔗		phuzion has quit IRC (Read error: Operation timed out)
06:05 ^🔗		sep332 has quit IRC (Read error: Operation timed out)
06:05 ^🔗		midas1 has quit IRC (Read error: Operation timed out)
06:07 ^🔗		midas1 has joined #archiveteam
06:07 ^🔗		swebb sets mode: +o midas1
06:07 ^🔗		sep332 has joined #archiveteam
06:10 ^🔗		Fake-Name has quit IRC (Ping timeout: 501 seconds)
06:13 ^🔗		BlueMaxim has joined #archiveteam
06:13 ^🔗		phuzion has joined #archiveteam
06:13 ^🔗		Fake-Name has joined #archiveteam
06:49 ^🔗		zerbrnky has joined #archiveteam
06:49 ^🔗	zerbrnky	hi all, anyone around?
06:49 ^🔗	tuankiet	Any problem?
06:50 ^🔗	JesseW	Zebranky: no. But ask whatever you were going to ask anyway...
06:50 ^🔗	zerbrnky	hm i should use a different nick D:
06:51 ^🔗	zerbrnky	i'm not Zebranky (i use a variant of this nick on places where longer names are allowed)
06:51 ^🔗		zerbrnky is now known as rbraun
06:51 ^🔗	JesseW	oops, sorry
06:51 ^🔗	rbraun	i was looking through the gawker dumps on archive.org and yeah, there might be a problem
06:52 ^🔗	JesseW	well, a lot of our most recent stuff may not have made it up there yet
06:52 ^🔗	JesseW	and we know about the robots.txt issues
06:52 ^🔗	JesseW	is there a different problem?
06:52 ^🔗	rbraun	it looks like they were grabbed by grabbing the sitemap for each month and then grabbing from there
06:52 ^🔗	rbraun	the problem is that the sitemap for especially busy months can't be grabbed a whole month at a time
06:53 ^🔗	JesseW	hm, yeah that could be an issue. godane?
06:53 ^🔗	rbraun	so e.g. everything in January 2010 before 1/19 is missing from both this: https://archive.org/details/gawker.com-sitemap-2010-20160322
06:53 ^🔗	rbraun	and from web.archive.org too
06:53 ^🔗	rbraun	rather it's not all missing from web.archive.org but some pages are
06:54 ^🔗	rbraun	and many of the pages that /are/ there weren't crawled this year, indicating the bulk grab in march didn't hit them
06:54 ^🔗	rbraun	this seems to be a bigger problem for older pages (probably back when they still paid their writers by the article)
06:54 ^🔗	JesseW	do you know of a way to get a list of the missing pages?
06:55 ^🔗	rbraun	yeah, you just see what the start date for the sitemap was and edit the end date to be that, iterate until it grabs thru the first of the month
06:55 ^🔗	rbraun	i'm working on it now but i was wondering if anyone had already done it
06:56 ^🔗	rbraun	e.g. january 2010 takes 3 pulls
06:56 ^🔗	rbraun	and then of course all the pages...
06:56 ^🔗	rbraun	(january 2012 is complete, though)
06:56 ^🔗	JesseW	godane is the person who has been working on it; hopefully he'll speak up
07:03 ^🔗		Honno has joined #archiveteam
07:11 ^🔗		JesseW has quit IRC (Ping timeout: 370 seconds)
07:11 ^🔗	rbraun	is there a faster way to force wayback to crawl a list of URLs than just loading http://web.archive.org/save/[URL] for each?
07:13 ^🔗	PurpleSym	Try #archivebot
07:18 ^🔗	rbraun	oh, nice, there is a non-recursive option
07:20 ^🔗	rbraun	archiveonly < FILE is probably what i need, thanks
07:21 ^🔗	rbraun	when archivebot uploads a WARC to archive.org, does it end up in web.archive.org too?
07:21 ^🔗	rbraun	in wayback, that is
07:23 ^🔗	PurpleSym	Yes, that’s the point.
07:23 ^🔗	rbraun	ok, thanks, this looks easier than i thought
07:24 ^🔗	rbraun	(fwiw i first discovered this issue when i noticed something from jan 2010 wasn't in wayback at all; then, found it wasn't in the collection i linked either)
07:28 ^🔗	rbraun	gut feeling is that 2007-2011 are affected in part
07:28 ^🔗	rbraun	(looking at http://gawkerdata.kinja.com/closing-the-book-on-gawker-com-1785555716)
07:31 ^🔗		REiN^ has quit IRC (Read error: Connection reset by peer)
07:33 ^🔗		phuzion has quit IRC (Read error: Operation timed out)
07:36 ^🔗		phuzion has joined #archiveteam
07:52 ^🔗		schbirid has joined #archiveteam
08:14 ^🔗	godane	based on site map its 2010-01-19 on: gawker.com/sitemap_bydate.xml?startTime=2010-01-01T00:00:00&endTime=2010-01-31T23:59:59
08:14 ^🔗	godane	ok i see the problem: http://gawker.com/sitemap_bydate.xml?startTime=2010-01-01T00:00:00&endTime=2010-01-01T23:59:59
08:15 ^🔗	godane	sometimes those sitemaps do some weird shit
08:16 ^🔗	rbraun	godane: do you have the missing ones or should i keep compiling them and feed them to archivebot?
08:16 ^🔗	rbraun	i have 2010 almost ready
08:17 ^🔗	godane	you can feed them into archivebot if you want to
08:17 ^🔗	godane	i will also see about doing it
08:17 ^🔗	rbraun	i checked several URLs from my file; some of them are in wayback and some not
08:18 ^🔗	rbraun	ok; i'm working on 2010 but i think all of 2007-11 might be affected based on volume
08:18 ^🔗	rbraun	(and the URLs not in wayback weren't saved in the big March dump, they were crawled earlier)
08:19 ^🔗	godane	i maybe doing a daily grabs now
08:19 ^🔗	godane	regrabs of what i got
08:19 ^🔗	rbraun	also really uncertain how long any of the site will stay up so
08:20 ^🔗	godane	i will work on gawker.com sitemap
08:20 ^🔗	rbraun	note that in every case i saw, if the monthly grab by default returned through X date, the original grab had all of those articles
08:21 ^🔗	rbraun	but not all the ones before that
08:26 ^🔗	godane	i'm redump grawker.com as daily sitemap grab
08:30 ^🔗	godane	kataku.com has the same problem
08:30 ^🔗	godane	*kotaku.com
08:37 ^🔗		REiN^ has joined #archiveteam
08:43 ^🔗		WinterFox has joined #archiveteam
08:58 ^🔗	rbraun	godane: do you want what i have for 2010? might save some time
08:59 ^🔗	godane	its not going to save me time sadly
09:00 ^🔗	godane	my script make a run at the sitemap by the day now
09:00 ^🔗	rbraun	ok
09:00 ^🔗	godane	also i will have to do that with all of gawker sites
09:00 ^🔗	rbraun	some of them i think don't have enough articles for this to have been an issue
09:01 ^🔗	rbraun	not sure which ones though
09:01 ^🔗	godane	i have uploaded some of thoses
09:01 ^🔗	godane	they were in the 10 to 100mb
09:01 ^🔗	godane	range
09:02 ^🔗	rbraun	might save the crawler time at least to not have to recrawl what's known already in the archive?
09:02 ^🔗		BartoCH has joined #archiveteam
09:02 ^🔗	rbraun	(several different ways to do that; i was just using the date cutoff)
09:05 ^🔗	rbraun	also i'm not sure how much time is left for gawker.com specifically
09:08 ^🔗	godane	btw the sitemap cut off is weird
09:09 ^🔗	godane	like for 2008-11 i can get 3034 urls with gawker but only 1971 urls with kotaku.com
09:09 ^🔗	Medowar	google code is empty. Can someone requeue
09:12 ^🔗	rbraun	godane: there are fewer articles total for that month on kotaku though
09:13 ^🔗	rbraun	godane: for 2008-11 if i request the whole month it cuts of at the 14th for gawker but the 6th for kotaku
09:13 ^🔗	rbraun	oh, i see
09:13 ^🔗	rbraun	yeah, why didn't it grab the whole month for kotaku...
09:13 ^🔗	rbraun	fwiw their own sitemaps link in 1-week increments
09:14 ^🔗	rbraun	http://gawker.com/sitemap.xml
09:14 ^🔗	rbraun	not sure i trust that given how uneven it is but i haven't found a case where it failed yet
09:17 ^🔗		HCross has quit IRC (Ping timeout: 246 seconds)
09:17 ^🔗		HCross has joined #archiveteam
09:22 ^🔗	godane	sitemaps for 2006-01 are start to be uploaded: https://archive.org/details/gawker.com-sitemap-2006-01-09-20160823
09:24 ^🔗	godane	i'm doing 11 months of daily sitemaps at once :-D
09:24 ^🔗	rbraun	that's going to produce a lot of collections... any reason not to combine those by month?
09:24 ^🔗	rbraun	also FYI while investigating this, the sitemap_bydate.xml was giving me 500 errors sometimes
09:25 ^🔗	rbraun	that was reliable if i didn't request whole-day increments
09:25 ^🔗	rbraun	but it happened some other times too; just reloading fixed it
09:25 ^🔗	godane	my script use curl to grab the sitemap by day then starts the download
09:26 ^🔗	rbraun	why not cat those together like a month at a time?
09:27 ^🔗	godane	cause i was not planing on doing that
09:27 ^🔗	rbraun	well, the reason i ask is
09:27 ^🔗	rbraun	the sitemaps provide an index of article titles
09:28 ^🔗	rbraun	so if i know gawker published an article in 1/2010 but i don't know which day...
09:28 ^🔗	rbraun	and i only know one word of the title or something
09:28 ^🔗	godane	https://archive.org/details/archiveteam-fire?and[]=subject%3A%22www.dailymail.co.uk%22
09:28 ^🔗	godane	i do it by date of sitemap
09:29 ^🔗	rbraun	it's also easier to verify everything is in there if it's in larger chunks
09:29 ^🔗		vOYtEC has quit IRC (Ping timeout: 244 seconds)
09:29 ^🔗	godane	i make a month sitemap may make me confuse
09:30 ^🔗	godane	thinking it was done the old method when gawker sitemap doesn't get everything
09:30 ^🔗	godane	so the daily dumps are meant to be different since the month and yearly failed
09:31 ^🔗	godane	i can turn the daily dumps into monthly or yearly for that reason
09:32 ^🔗	rbraun	hmm ok
09:34 ^🔗	godane	i'm mostly trying to keep the raw sitemap urls the same set as date of urls
09:34 ^🔗		HCross2 has quit IRC (Quit: Connection closed for inactivity)
09:37 ^🔗		schbird has joined #archiveteam
09:37 ^🔗	schbird	is there a way to record mouse/keyboard interaction with webrecorder.io or a similar tool?
09:37 ^🔗	rbraun	godane: can your script handle the case where it returns a 500 error and retry?
09:38 ^🔗	schbird	to actually replay all "user" interaction
09:38 ^🔗	rbraun	godane: i guess curl --retry 10 or something
09:41 ^🔗	rbraun	i was getting those intermittently even on sitemap_bydate requests that would later complete
09:49 ^🔗	godane	i'm not really getting those errors
09:49 ^🔗	godane	i get them on days that don't exist i think
09:49 ^🔗	rbraun	no, i get empty files (or with the front page only) on days that don't exist
09:50 ^🔗	rbraun	i get 500 errors when it's cranky or if i try to pull a partial day (which doesn't work)
09:50 ^🔗	rbraun	but in the former case i had to retry a few times
09:50 ^🔗	rbraun	if you pass --retry <#> to curl with some number of retries allowed, you should have no problem though
09:52 ^🔗	rbraun	only getting that on the sitemaps occasionally, not the actual pages
09:52 ^🔗		Selavi has quit IRC (Ping timeout: 260 seconds)
09:53 ^🔗		Kksmkrn has joined #archiveteam
09:53 ^🔗		Kksmkrn has quit IRC (Connection closed)
09:53 ^🔗		Kksmkrn has joined #archiveteam
09:53 ^🔗	godane	i'm going to bed now
09:53 ^🔗	godane	i will continue tomorrow
09:54 ^🔗	rbraun	ok good night
10:00 ^🔗		Selavi has joined #archiveteam
10:09 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
10:16 ^🔗		BartoCH has joined #archiveteam
10:28 ^🔗		enr1c0 has joined #archiveteam
10:31 ^🔗		Kksmkrn has quit IRC (Quit: leaving)
11:00 ^🔗		enr1c0 has quit IRC (Quit: ZNC 1.6.3+deb1 - http://znc.in)
11:00 ^🔗		enr1c0 has joined #archiveteam
11:01 ^🔗		enr1c0 has left
11:26 ^🔗		enr1c0 has joined #archiveteam
11:30 ^🔗		enr1c0 has quit IRC (Client Quit)
11:31 ^🔗		enr1c0 has joined #archiveteam
11:31 ^🔗		enr1c0 has left
11:35 ^🔗		HCross has quit IRC (Ping timeout: 246 seconds)
11:35 ^🔗		HCross has joined #archiveteam
12:28 ^🔗		irl has joined #archiveteam
12:29 ^🔗	irl	ok, so i was here a while ago and i'm trying to archive a whole bunch of paper manuals and documents from the 70s-90s from obscure networking hardware and computer programs relating to networking and such
12:30 ^🔗	irl	following a complete mess trying to use the university's MFD devices (they scan to email only, and couldn't do large attachments, so i was limited to ~5 pages)
12:30 ^🔗	irl	i've now decided i want to buy a scanner with an ADF to sit in the lab
12:30 ^🔗	irl	can anyone recommend such a scanner that can handle various paper types, and paper with binding holes etc. that isn't going to break constantly?
12:31 ^🔗	irl	ideally it would have linux support and not be networked, but direct into the pc
12:31 ^🔗	irl	ideally it would also be fast-ish, but i'll take reliability over speed
12:32 ^🔗	PurpleSym	I recently built a 25€ DIY book scanner, but it’s quite slow.
12:32 ^🔗	irl	i'm talking ~10,000 ish pages of manuals
12:32 ^🔗	irl	they're mostly A4 paper that's been punched and hand-bound
12:32 ^🔗	PurpleSym	So, destructive scanning then?
12:33 ^🔗	irl	with those plastic binding things
12:33 ^🔗	PurpleSym	I see.
12:33 ^🔗	irl	my hope is to be able to just put the plastic things back on them afterwards
12:34 ^🔗	irl	i've looked through ebay for scanners with adf, but i have no idea how reliable these things are
12:35 ^🔗	irl	the HP 9200C 9200 Digital Sender seems to come up a lot and looks quite heavy duty
12:35 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
12:37 ^🔗		BartoCH has joined #archiveteam
12:46 ^🔗	irl	purchased a 9200c, seems to have good reviews
12:47 ^🔗	irl	i'm guessing a lot of these things will have valid copyright
12:47 ^🔗	irl	any advice on how to work out what i can publish and what i shouldn't publish?
12:48 ^🔗	irl	is there a place i can stash things until the copyright expires?
12:50 ^🔗	joepie91	irl: IA :)
12:50 ^🔗	joepie91	irl: IA will dark things if they get complaints
12:51 ^🔗	joepie91	where 'dark' === "it's still in the archives but not publicly accessible"
12:51 ^🔗	joepie91	(also you might want to talk to SketchCow regarding manuals)
12:51 ^🔗		atomotic has joined #archiveteam
13:03 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
13:04 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)
13:04 ^🔗		BartoCH has joined #archiveteam
13:21 ^🔗	irl	joepie91: ah cool (:
13:21 ^🔗	irl	so i can basically automate most of this then using scanner->ftp->git-annex-assistant->ia
13:21 ^🔗	irl	just need to get the right metadata in the right places
13:21 ^🔗	irl	SketchCow: i might want to talk to you
13:24 ^🔗		WinterFox has quit IRC (Read error: Operation timed out)
13:27 ^🔗		beardicus has quit IRC (bye)
13:27 ^🔗		atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
13:28 ^🔗		dashcloud has quit IRC (Read error: Operation timed out)
13:31 ^🔗		beardicus has joined #archiveteam
13:31 ^🔗		swebb sets mode: +o beardicus
13:35 ^🔗		beardicus has quit IRC (Client Quit)
13:37 ^🔗		beardicus has joined #archiveteam
13:37 ^🔗		swebb sets mode: +o beardicus
13:45 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
13:45 ^🔗		BartoCH has joined #archiveteam
13:46 ^🔗		wp494 has quit IRC (Read error: Operation timed out)
13:47 ^🔗		dashcloud has joined #archiveteam
14:42 ^🔗		tomwsmf has joined #archiveteam
14:47 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
14:52 ^🔗		irl_ has joined #archiveteam
14:53 ^🔗		irl has quit IRC (Quit: WeeChat 1.5)
14:53 ^🔗		irl_ is now known as irl
14:56 ^🔗		irl has quit IRC (Client Quit)
14:57 ^🔗		irl has joined #archiveteam
15:00 ^🔗	irl	SketchCow: if you're interested in old manuals, i can get you a list of the things we maybe have
15:01 ^🔗		nicolas17 has joined #archiveteam
15:01 ^🔗	irl	SketchCow: i'm now idling here via znc, so i'll see when you respond as i guess you're not around right now
15:02 ^🔗	irl	i'll be at debian uk bbq eating burgers this weekend, but will probably start a go at this the following weekend
15:02 ^🔗	irl	(slow start - not diving in)
15:15 ^🔗		wp494 has joined #archiveteam
15:18 ^🔗		JesseW has joined #archiveteam
15:23 ^🔗		schbird has quit IRC (Read error: Operation timed out)
15:25 ^🔗		JesseW has quit IRC (Read error: Operation timed out)
15:34 ^🔗		BartoCH has joined #archiveteam
15:56 ^🔗		VADemon has joined #archiveteam
16:13 ^🔗	SketchCow	Hugs to irl
16:35 ^🔗	irl	hello
16:35 ^🔗	irl	SketchCow:
16:35 ^🔗	irl	still here?
16:42 ^🔗		HCross2 has joined #archiveteam
16:42 ^🔗	SketchCow	Yep.
16:43 ^🔗	SketchCow	So much talking. Come to #archiveteam-bs
17:00 ^🔗		tomaspark has quit IRC (Ping timeout: 255 seconds)
17:02 ^🔗		db48x` is now known as db48x
17:06 ^🔗	arkiver	bayimg is online again
17:06 ^🔗	arkiver	I restarted the script
17:06 ^🔗	arkiver	it's not yet in the warrior
17:06 ^🔗	arkiver	http://tracker.archiveteam.org/bayimg/
17:06 ^🔗	arkiver	* restarted the projects
17:06 ^🔗	arkiver	project*
17:18 ^🔗	SketchCow	OK SO FINALLY
17:19 ^🔗	SketchCow	http://fos.textfiles.com/pipeline.html is in version 1.0. It'll run once a day (with an indication of when it was run). It's NOT real-time, it's just a way for your nerds to notice what's going on on the site, and be able to communicate with me or each other on a status.
17:20 ^🔗	SketchCow	It's Inbox --> Outbox --> IA, and if there's interruptions at IA, the Outbox might fill and "work" but will leave some items untouched.
17:20 ^🔗	arkiver	some projects seem to be missing
17:21 ^🔗	SketchCow	It's generating right now, but Orkut is such a nightmare, it will sit there for a while. I added another black-label "line" at the bottom of the table so you can see the difference between "running" and done. Looks like 10-15 minutes of disk thrashing to get through the mess.
17:21 ^🔗	arkiver	I see
17:21 ^🔗	SketchCow	In the future, when it has the second black line at the bottom, if it's not there, it's not in the pipeline.
17:22 ^🔗	SketchCow	The script in the future will probably run in 5 minutes, as long as insanities like orkut aren't going on.
17:23 ^🔗	SketchCow	So for example, the WHOLE pipeline is backed up (google code is at 187g) because of Orkut
17:23 ^🔗	arkiver	Yep
17:23 ^🔗	SketchCow	But at least now, in the future, one of you can go "Hey, looks like boombox project is at 300gb for some reason" and we can jump on that.
17:24 ^🔗	SketchCow	Or "it's time to add an upload script to this or that project"
17:24 ^🔗	arkiver	orkut is going down in 8 days, so just a little more time
17:26 ^🔗	DigDug	i thought orkut was long gone
17:26 ^🔗	arkiver	still here as an archive https://orkut.google.com/en.html
17:27 ^🔗	Kaz	are we on track to finish orkut? I have more available if FOS can handle it, if needed.
17:27 ^🔗	arkiver	I think we're going to make it
17:27 ^🔗	arkiver	Tomorrow or the day after we're going to retry the larger communities, so you might have to do a little less concurrent
17:28 ^🔗	Kaz	nod
17:28 ^🔗	arkiver	But I'll want you before we do that
17:28 ^🔗	arkiver	the larger communities can be millions of posts
17:28 ^🔗	arkiver	(and URLs)
17:29 ^🔗	SketchCow	So, the script is going to finish running, and I'm going to make two improvements.
17:29 ^🔗	SketchCow	First, it will not copy over the finished .html file until it's 100% done, so in the future, it's just "there" and not "in progress"
17:30 ^🔗	SketchCow	Second, I'm going to make a "cheat sheat" which will occasionally be forgotten by me to update but will change the "Project" name into something better.
17:30 ^🔗	nicolas17	I tried archiving orkut and it seemed like you didn't need more nodes
17:30 ^🔗	nicolas17	since most of the time I got rate-limiting by the tracker anyway
17:31 ^🔗	nicolas17	so the download rate was limited by that setting, not by how many people were running the warrior
17:42 ^🔗		AlexLehm has joined #archiveteam
17:54 ^🔗	SketchCow	http://fos.textfiles.com/pipeline.html just finished.
17:55 ^🔗	SketchCow	NOW you can rain down questions
18:00 ^🔗		schbird has joined #archiveteam
18:25 ^🔗		pfallenop has quit IRC (Ping timeout: 260 seconds)
18:25 ^🔗		pfallenop has joined #archiveteam
18:30 ^🔗		schbird has quit IRC (Read error: Operation timed out)
18:34 ^🔗		Zialus has quit IRC (Read error: Operation timed out)
18:34 ^🔗	HCross2	arkiver: let me know when, and I'll reduce my quarter of a trillion concurrent
18:38 ^🔗		Zialus has joined #archiveteam
18:40 ^🔗	arkiver	SketchCow: it would be nice if it also shows megaWARC size
19:19 ^🔗		VerifiedJ has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client)
19:24 ^🔗	SketchCow	Not really easy to do that, since stuff will be either out or stuck.
19:24 ^🔗	SketchCow	Oh wait.
19:24 ^🔗	SketchCow	Mmm, let me see
19:28 ^🔗	SketchCow	I got it working. It's re-running and it'll update with it after it's done.
19:29 ^🔗	SketchCow	(almost all are 40gb but it's trivial to print it)
19:29 ^🔗	SketchCow	if someone wants to be a hero and wiki all this, go ahead
19:52 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
19:58 ^🔗		BartoCH has joined #archiveteam
20:39 ^🔗	ats	one that needs crawling for the magazines collection: http://www.muzines.co.uk
20:39 ^🔗	ats	sadly it has a stupid obnoxious Javascript-based interface...
20:47 ^🔗	schbirid	seems to work mostly fine without js here
21:04 ^🔗		HCross2 has quit IRC (Quit: Connection closed for inactivity)
21:13 ^🔗		Morbus has joined #archiveteam
21:15 ^🔗		VerifiedJ has joined #archiveteam
21:19 ^🔗		schbird has joined #archiveteam
21:27 ^🔗	VerifiedJ	GTAGaming.com's database was compromised and they may be think about shutting the website down along with www.gta4-mods.com. http://www.gtagaming.com/news/comments.php?i=2369
21:40 ^🔗		Honno has quit IRC (Read error: Operation timed out)
21:50 ^🔗		VerifiedJ has left
21:58 ^🔗		vOYtEC has joined #archiveteam
21:59 ^🔗		schbird has quit IRC (Leaving)
22:07 ^🔗		schbirid2 has joined #archiveteam
22:10 ^🔗		schbirid has quit IRC (Read error: Operation timed out)
22:33 ^🔗		RichardG has joined #archiveteam
22:42 ^🔗		schbirid2 has quit IRC (Read error: Operation timed out)
22:45 ^🔗		schbirid2 has joined #archiveteam
22:47 ^🔗		AlexLehm has quit IRC (Ping timeout: 260 seconds)
23:16 ^🔗		JW_work1 has joined #archiveteam
23:18 ^🔗		JW_work has quit IRC (Read error: Operation timed out)
23:23 ^🔗		RichardG has quit IRC (Read error: Operation timed out)
23:28 ^🔗	SketchCow	Who here can read an ext3 disk and is comfortable with possibly having to do a dd and then extracting of data
23:28 ^🔗	SketchCow	US preferred
23:29 ^🔗	nicolas17	you mean a physical disk, or?
23:36 ^🔗	SketchCow	Physical, here in front of me.
23:37 ^🔗	*	nicolas17 is physically too far
23:37 ^🔗	Frogging	what's involved in it? i.e. why can't you do it?
23:37 ^🔗		rchrch has joined #archiveteam
23:37 ^🔗	SketchCow	Don't want to
23:37 ^🔗	Frogging	ah
23:37 ^🔗	SketchCow	If you're asking what's involved, you're not for the job
23:38 ^🔗	nicolas17	well, he's asking eg. is it a corrupted ext3 you have to recover things out of, or just a clean filesystem but you have no Linux? :P
23:38 ^🔗	Frogging	yeah, basically^
23:39 ^🔗	Frogging	I can do magic with block devices but I'm not so good at fixing physically broken disks
23:39 ^🔗	SketchCow	Not broken
23:40 ^🔗	Frogging	ah. can you ship?
23:43 ^🔗	Frogging	I assume so because you said US preferred. I'm in Canada though. But if nobody closer wants to then I volunteer
23:44 ^🔗	Frogging	I enjoy this sort of thing
23:45 ^🔗		kristian_ has joined #archiveteam
23:48 ^🔗		RichardG has joined #archiveteam
23:56 ^🔗		Stiletto has quit IRC (Ping timeout: 246 seconds)
23:57 ^🔗	SketchCow	You're in line
23:57 ^🔗	SketchCow	We'll see if anyone else in the US wants it.
23:58 ^🔗	SketchCow	I can sustain a canadian mailing

irclogger-viewer