#archiveteam 2013-08-09,Fri

↑back Search

Time	Nickname	Message
01:52 ^🔗	wp494	asking again since it wasn't done: would anyone with admin access to the wiki add the puush project to the main page?
01:57 ^🔗	ivan`	winr4r can do that but he ain't here
07:52 ^🔗	SmileyG	http://support.tu.com/t5/English-Community/ct-p/English_Community - Hi, weâre sorry to say that TU Me is being discontinued. During the next month weâll be phasing out our support for TU Me, which means that after Sunday September 8th it will be inoperable. For your safety and security, all of your messages, photos and data will be permanently deleted from our servers and the app will not be available for ...
07:52 ^🔗	SmileyG	... download via the App Store, the Google Play Store, and the BlackBerry App World. If you have any questions or queries, please email us at support@tu.com.
07:53 ^🔗	SmileyG	Has a forum, can someone grab?
07:53 ^🔗	SmileyG	i may no longer have access to this machine later, so yeah.
07:56 ^🔗	ivan`	I am warc'ing support. and www.
07:57 ^🔗	SmileyG	ty
07:57 ^🔗	SmileyG	hell, warc the whole site :) (i'm guessing that pretty much IS the whole thing).
08:12 ^🔗	SmileyG	-rw-r--r-- 1 tim.bowers games 982M Aug 7 00:11 ./tribes_forum_06082013.warc
08:12 ^🔗	SmileyG	and then it 404'ed
08:14 ^🔗	SmileyG	no cdx tho o_O
19:47 ^🔗	ATZ0	So... I've gone off on a rant about this in the past: http://www.zdnet.com/aol-patch-upheaval-hundreds-of-layoffs-but-also-new-ceo-7000019218/ - we can has start mirroring patch.com sites?
19:48 ^🔗	ATZ0	I looked into at at like 1am one evening, and didn't take notes, but when I tried wget they were blocking my IP After X requests
19:49 ^🔗	ATZ0	As for why this is important - "As for the present, 400 Patch sites are on the chopping block to either be merged with another site -- or shuttered altogether."
19:49 ^🔗	antomatic	erk
19:49 ^🔗	ATZ0	Given the closure of small newspapers, these sites represent the last sliver of local media in a lot of these markets
19:58 ^🔗	ATZ0	i'll fill in what detail i know. basically - http://www.patch.com/ , click a state, you get the local patch sites that pop up.
19:58 ^🔗	ATZ0	the sites aggregate/share content across multiple local sites, so if you can somehow dedupe the storage it should take up a lot less room.
20:00 ^🔗	ATZ0	wget pointed at the root of the site ie lakeridge.patch.com with appropriate options was working for me, but after a certain number of grabs it was then throwing me back a "You've been doing that too much" or similar message, which I figured out was IP-tied (non-cookie based), and if I recall correctly after 20-30 minutes it allowed resumption of donwloading
20:01 ^🔗	ATZ0	not sure if that throttling is per local site, or patch.com overall.
20:01 ^🔗	antomatic	any idea what kind of size a patch.com local site is, on disc, roughly?
20:02 ^🔗	ATZ0	it's going to vary based on the age of the site
20:02 ^🔗	ATZ0	they didn't all just spring up at once
20:02 ^🔗	antomatic	ah
20:02 ^🔗	ATZ0	plus the frequency/usage of it. say some oklahoma site who has a very active editor and a large user population might be 4x the size of a lesser used site where it never caught on
20:07 ^🔗	ATZ0	http://techcrunch.com/2013/08/09/armstrong-confirms-hundreds-of-layoffs-at-patch-400-sites-shuttered-or-partnered-off-and-a-new-ceo/ - this is more urgent then i thought
20:07 ^🔗	ATZ0	not 400 possibly closing, 400 will be closing/"partnered"
20:07 ^🔗	ATZ0	let's not let AOL destroy the last 3 years of local news in some places
20:08 ^🔗	antomatic	seems like a good choice for archiving.
20:12 ^🔗	ATZ0	and now that i've waved my arms in the air and screamed fire, i'll let the heroes who know how to actually script this stuff hopefully run with it and let my happy archivteam warrior contribute to the mongol horde.
20:15 ^🔗	antomatic	Crikey, from the page source of patch.com I make it 909 individual sites
20:15 ^🔗	ATZ0	that sounds about right.
20:16 ^🔗	godane	so it turns out nfts-3g frezzes slackware when copying files
20:17 ^🔗	godane	i thought it was cause i was copying + downloading + uploading from the same drive
20:17 ^🔗	godane	to another ntfs drive
20:30 ^🔗	antomatic	http://archiveteam.org/index.php?title=List_of_Patch.com_sites
20:32 ^🔗	ATZ0	you know, when you say 909 it doesn't sound imposing but then that list ...
20:32 ^🔗	ATZ0	O_o
20:34 ^🔗	antomatic	Aah... how long could it possibly take? :)
20:35 ^🔗	ATZ0	other than the throttling, not that long i think
20:35 ^🔗	DFJustin	we downloaded a million splinder sites :P
20:53 ^🔗	yipdw	well, I'm grabbing altadena.patch.com as a test
20:55 ^🔗	yipdw	seems like the usual wget-warc commands are doing fine
20:56 ^🔗	ATZ0	i'm curious if it starts to die on you like it did for me
20:56 ^🔗	yipdw	define die
20:56 ^🔗	yipdw	I'm running with --random-wait, --wait 1
20:56 ^🔗	ATZ0	started getting html pages returned with throttling, but i probably wasnt using the wait like that
20:56 ^🔗	yipdw	so wget waits between 0.5 and 1.5 seconds between requests
20:56 ^🔗	ATZ0	it was late, i may have been drunk :)
20:57 ^🔗	yipdw	or more specifically, I'm running https://gist.github.com/yipdw/04e3883a9cdb87735fc4
21:04 ^🔗	antomatic	I suppose the other question is how many of these sites are already in the Wayback machine with reasonably-recent crawls.
21:04 ^🔗	antomatic	hm, might have a way to do that..
21:05 ^🔗	yipdw	antomatic: one site per person might be doable
21:05 ^🔗	yipdw	e.g. with a tracker
21:06 ^🔗	yipdw	I'll set one up
21:07 ^🔗	yipdw	SmileyG: can you get a project on tracker.archiveteam.org set up for this patch grab?
21:08 ^🔗	yipdw	in the meantime, I'll deploy universal-tracker elsewhere
21:19 ^🔗	antomatic	Right, got a script running at the moment checking which patches are in and not in wayback
21:19 ^🔗	antomatic	(suspect they all will be, but obviously of differing ages)
21:19 ^🔗	ATZ0	i feel like Hannibal Smith, I love it when a plan comes together
21:20 ^🔗	ATZ0	(insert parody A-Team logo here)
21:22 ^🔗	ATZ0	in 2013, a crack-addeled-braine website of an idea was sent to death by a CEO for a fate they didn't deserve. This website promptly escaped near death from a maximum-security disk wipe to the ArchiveTeam underground. Today, still wanted by the corporation that spawned it, it survives as an archive of history. If you have a problem...if no one else can help... and if you can find them...
21:22 ^🔗	ATZ0	maybe you can hire...The Archive Team.
21:26 ^🔗	Nemo_bis	nice https://www.eff.org/press/releases/judge-grants-preliminary-injunction-protect-free-speech-after-eff-challenge
21:32 ^🔗	antomatic	The results, like the doctor, is IN. Or are in. Hm. Anyway...
21:33 ^🔗	antomatic	Yes, every single Patch has some presence in Wayback.
21:33 ^🔗	antomatic	That might have been a stupidly obvious question, though.
21:34 ^🔗	antomatic	Most of the few crawls I've checked manually seem to date from May 2013
21:35 ^🔗	antomatic	Some later, too.
21:39 ^🔗	godane	uploaded: https://archive.org/details/Y2K_Family_Survival_Guide_With_Leonard_Nimoy_Palsojom1.X264.CG
21:42 ^🔗	DFJustin	nice find
21:48 ^🔗	yipdw	antomatic: well, good to know there's at least a backup
21:48 ^🔗	antomatic	(nod)
21:48 ^🔗	antomatic	May still be worth doing updated grabs, of course.
21:58 ^🔗	omf_	We always try to grab closing sites because even if something is on IA there is no guarantee of the depth of that crawl.
21:58 ^🔗	antomatic	(nods)
21:59 ^🔗	omf_	and crawl frequency only indicates the newer content was grabbed.
21:59 ^🔗	omf_	having stuff already on IA helps us for URL discovery which is annoying at best
22:00 ^🔗	omf_	I will do a common crawl search as well, it takes a while though since it is a 22gb bz2 file :D
22:01 ^🔗	omf_	I think it too 240 minutes last time
22:01 ^🔗	omf_	I know, I know I need to upgrade to SSD
22:02 ^🔗	antomatic	I cheated by scraping every site out of the source of www.patch.com - but I guess if the common crawl knows of any sites that might not be indexed from Patch.com itself..?
22:02 ^🔗	antomatic	Don't know if any older sites have come and gone in patch's lifetime
22:02 ^🔗	antomatic	etc.
22:08 ^🔗	godane	so is anyone going to go after theisozone.com?
22:08 ^🔗	godane	i ask cause i find lots of stuff on there
22:08 ^🔗	godane	i'm getting a psm dec 2004 iso
22:12 ^🔗	omf_	I will look into it now godane
22:13 ^🔗	omf_	ugh cloudstore file hosting
22:19 ^🔗	godane	i know
22:21 ^🔗	godane	good news is that cloudstore supports resume
22:30 ^🔗	omf_	That is not the problem with sites like these. You have to drive a javascript supported program in order to trigger the download on the cloudstores web page, a link to the file is not exposed. Granted some of these shit sites already have workaround libraries and tools
22:31 ^🔗	omf_	The upside is theisozone has a very sane url scheme
22:35 ^🔗	godane	i got a head start on mirror glenn beck highlights
22:35 ^🔗	godane	trying to mirror that cause i don't really have video of everything on his network
22:36 ^🔗	godane	this sort of solves this and this stuff should hopefully not need to go dark cause its on the guys website
22:36 ^🔗	godane	for free
23:06 ^🔗	godane	now cloudstore has something special
23:07 ^🔗	godane	i think the url only 1 hour
23:07 ^🔗	godane	but you grab a new url and keep going
23:42 ^🔗	godane	that was a waste of time
23:43 ^🔗	godane	in less your download small files on theisozone your most likely will be out of luck

irclogger-viewer