#archiveteam 2013-08-09,Fri

↑back Search

Time Nickname Message
01:52 πŸ”— wp494 asking again since it wasn't done: would anyone with admin access to the wiki add the puush project to the main page?
01:57 πŸ”— ivan` winr4r can do that but he ain't here
07:52 πŸ”— SmileyG http://support.tu.com/t5/English-Community/ct-p/English_Community - Hi, weҀ™re sorry to say that TU Me is being discontinued. During the next month weҀ™ll be phasing out our support for TU Me, which means that after Sunday September 8th it will be inoperable. For your safety and security, all of your messages, photos and data will be permanently deleted from our servers and the app will not be available for ...
07:52 πŸ”— SmileyG ... download via the App Store, the Google Play Store, and the BlackBerry App World. If you have any questions or queries, please email us at support@tu.com.
07:53 πŸ”— SmileyG Has a forum, can someone grab?
07:53 πŸ”— SmileyG i may no longer have access to this machine later, so yeah.
07:56 πŸ”— ivan` I am warc'ing support. and www.
07:57 πŸ”— SmileyG ty
07:57 πŸ”— SmileyG hell, warc the whole site :) (i'm guessing that pretty much IS the whole thing).
08:12 πŸ”— SmileyG -rw-r--r-- 1 tim.bowers games 982M Aug 7 00:11 ./tribes_forum_06082013.warc
08:12 πŸ”— SmileyG and then it 404'ed
08:14 πŸ”— SmileyG no cdx tho o_O
19:47 πŸ”— ATZ0 So... I've gone off on a rant about this in the past: http://www.zdnet.com/aol-patch-upheaval-hundreds-of-layoffs-but-also-new-ceo-7000019218/ - we can has start mirroring patch.com sites?
19:48 πŸ”— ATZ0 I looked into at at like 1am one evening, and didn't take notes, but when I tried wget they were blocking my IP After X requests
19:49 πŸ”— ATZ0 As for why this is important - "As for the present, 400 Patch sites are on the chopping block to either be merged with another site -- or shuttered altogether."
19:49 πŸ”— antomatic erk
19:49 πŸ”— ATZ0 Given the closure of small newspapers, these sites represent the last sliver of local media in a lot of these markets
19:58 πŸ”— ATZ0 i'll fill in what detail i know. basically - http://www.patch.com/ , click a state, you get the local patch sites that pop up.
19:58 πŸ”— ATZ0 the sites aggregate/share content across multiple local sites, so if you can somehow dedupe the storage it should take up a lot less room.
20:00 πŸ”— ATZ0 wget pointed at the root of the site ie lakeridge.patch.com with appropriate options was working for me, but after a certain number of grabs it was then throwing me back a "You've been doing that too much" or similar message, which I figured out was IP-tied (non-cookie based), and if I recall correctly after 20-30 minutes it allowed resumption of donwloading
20:01 πŸ”— ATZ0 not sure if that throttling is per local site, or patch.com overall.
20:01 πŸ”— antomatic any idea what kind of size a patch.com local site is, on disc, roughly?
20:02 πŸ”— ATZ0 it's going to vary based on the age of the site
20:02 πŸ”— ATZ0 they didn't all just spring up at once
20:02 πŸ”— antomatic ah
20:02 πŸ”— ATZ0 plus the frequency/usage of it. say some oklahoma site who has a very active editor and a large user population might be 4x the size of a lesser used site where it never caught on
20:07 πŸ”— ATZ0 http://techcrunch.com/2013/08/09/armstrong-confirms-hundreds-of-layoffs-at-patch-400-sites-shuttered-or-partnered-off-and-a-new-ceo/ - this is more urgent then i thought
20:07 πŸ”— ATZ0 not 400 possibly closing, 400 will be closing/"partnered"
20:07 πŸ”— ATZ0 let's not let AOL destroy the last 3 years of local news in some places
20:08 πŸ”— antomatic seems like a good choice for archiving.
20:12 πŸ”— ATZ0 and now that i've waved my arms in the air and screamed fire, i'll let the heroes who know how to actually script this stuff hopefully run with it and let my happy archivteam warrior contribute to the mongol horde.
20:15 πŸ”— antomatic Crikey, from the page source of patch.com I make it 909 individual sites
20:15 πŸ”— ATZ0 that sounds about right.
20:16 πŸ”— godane so it turns out nfts-3g frezzes slackware when copying files
20:17 πŸ”— godane i thought it was cause i was copying + downloading + uploading from the same drive
20:17 πŸ”— godane to another ntfs drive
20:30 πŸ”— antomatic http://archiveteam.org/index.php?title=List_of_Patch.com_sites
20:32 πŸ”— ATZ0 you know, when you say 909 it doesn't sound imposing but then that list ...
20:32 πŸ”— ATZ0 O_o
20:34 πŸ”— antomatic Aah... how long could it possibly take? :)
20:35 πŸ”— ATZ0 other than the throttling, not that long i think
20:35 πŸ”— DFJustin we downloaded a million splinder sites :P
20:53 πŸ”— yipdw well, I'm grabbing altadena.patch.com as a test
20:55 πŸ”— yipdw seems like the usual wget-warc commands are doing fine
20:56 πŸ”— ATZ0 i'm curious if it starts to die on you like it did for me
20:56 πŸ”— yipdw define die
20:56 πŸ”— yipdw I'm running with --random-wait, --wait 1
20:56 πŸ”— ATZ0 started getting html pages returned with throttling, but i probably wasnt using the wait like that
20:56 πŸ”— yipdw so wget waits between 0.5 and 1.5 seconds between requests
20:56 πŸ”— ATZ0 it was late, i may have been drunk :)
20:57 πŸ”— yipdw or more specifically, I'm running https://gist.github.com/yipdw/04e3883a9cdb87735fc4
21:04 πŸ”— antomatic I suppose the other question is how many of these sites are already in the Wayback machine with reasonably-recent crawls.
21:04 πŸ”— antomatic hm, might have a way to do that..
21:05 πŸ”— yipdw antomatic: one site per person might be doable
21:05 πŸ”— yipdw e.g. with a tracker
21:06 πŸ”— yipdw I'll set one up
21:07 πŸ”— yipdw SmileyG: can you get a project on tracker.archiveteam.org set up for this patch grab?
21:08 πŸ”— yipdw in the meantime, I'll deploy universal-tracker elsewhere
21:19 πŸ”— antomatic Right, got a script running at the moment checking which patches are in and not in wayback
21:19 πŸ”— antomatic (suspect they all will be, but obviously of differing ages)
21:19 πŸ”— ATZ0 i feel like Hannibal Smith, I love it when a plan comes together
21:20 πŸ”— ATZ0 (insert parody A-Team logo here)
21:22 πŸ”— ATZ0 in 2013, a crack-addeled-braine website of an idea was sent to death by a CEO for a fate they didn't deserve. This website promptly escaped near death from a maximum-security disk wipe to the ArchiveTeam underground. Today, still wanted by the corporation that spawned it, it survives as an archive of history. If you have a problem...if no one else can help... and if you can find them...
21:22 πŸ”— ATZ0 maybe you can hire...The Archive Team.
21:26 πŸ”— Nemo_bis nice https://www.eff.org/press/releases/judge-grants-preliminary-injunction-protect-free-speech-after-eff-challenge
21:32 πŸ”— antomatic The results, like the doctor, is IN. Or are in. Hm. Anyway...
21:33 πŸ”— antomatic Yes, every single Patch has some presence in Wayback.
21:33 πŸ”— antomatic That might have been a stupidly obvious question, though.
21:34 πŸ”— antomatic Most of the few crawls I've checked manually seem to date from May 2013
21:35 πŸ”— antomatic Some later, too.
21:39 πŸ”— godane uploaded: https://archive.org/details/Y2K_Family_Survival_Guide_With_Leonard_Nimoy_Palsojom1.X264.CG
21:42 πŸ”— DFJustin nice find
21:48 πŸ”— yipdw antomatic: well, good to know there's at least a backup
21:48 πŸ”— antomatic (nod)
21:48 πŸ”— antomatic May still be worth doing updated grabs, of course.
21:58 πŸ”— omf_ We always try to grab closing sites because even if something is on IA there is no guarantee of the depth of that crawl.
21:58 πŸ”— antomatic (nods)
21:59 πŸ”— omf_ and crawl frequency only indicates the newer content was grabbed.
21:59 πŸ”— omf_ having stuff already on IA helps us for URL discovery which is annoying at best
22:00 πŸ”— omf_ I will do a common crawl search as well, it takes a while though since it is a 22gb bz2 file :D
22:01 πŸ”— omf_ I think it too 240 minutes last time
22:01 πŸ”— omf_ I know, I know I need to upgrade to SSD
22:02 πŸ”— antomatic I cheated by scraping every site out of the source of www.patch.com - but I guess if the common crawl knows of any sites that might not be indexed from Patch.com itself..?
22:02 πŸ”— antomatic Don't know if any older sites have come and gone in patch's lifetime
22:02 πŸ”— antomatic etc.
22:08 πŸ”— godane so is anyone going to go after theisozone.com?
22:08 πŸ”— godane i ask cause i find lots of stuff on there
22:08 πŸ”— godane i'm getting a psm dec 2004 iso
22:12 πŸ”— omf_ I will look into it now godane
22:13 πŸ”— omf_ ugh cloudstore file hosting
22:19 πŸ”— godane i know
22:21 πŸ”— godane good news is that cloudstore supports resume
22:30 πŸ”— omf_ That is not the problem with sites like these. You have to drive a javascript supported program in order to trigger the download on the cloudstores web page, a link to the file is not exposed. Granted some of these shit sites already have workaround libraries and tools
22:31 πŸ”— omf_ The upside is theisozone has a very sane url scheme
22:35 πŸ”— godane i got a head start on mirror glenn beck highlights
22:35 πŸ”— godane trying to mirror that cause i don't really have video of everything on his network
22:36 πŸ”— godane this sort of solves this and this stuff should hopefully not need to go dark cause its on the guys website
22:36 πŸ”— godane for free
23:06 πŸ”— godane now cloudstore has something special
23:07 πŸ”— godane i think the url only 1 hour
23:07 πŸ”— godane but you grab a new url and keep going
23:42 πŸ”— godane that was a waste of time
23:43 πŸ”— godane in less your download small files on theisozone your most likely will be out of luck

irclogger-viewer