Time |
Nickname |
Message |
01:52
π
|
wp494 |
asking again since it wasn't done: would anyone with admin access to the wiki add the puush project to the main page? |
01:57
π
|
ivan` |
winr4r can do that but he ain't here |
07:52
π
|
SmileyG |
http://support.tu.com/t5/English-Community/ct-p/English_Community - Hi, weΓ’ΒΒre sorry to say that TU Me is being discontinued. During the next month weΓ’ΒΒll be phasing out our support for TU Me, which means that after Sunday September 8th it will be inoperable. For your safety and security, all of your messages, photos and data will be permanently deleted from our servers and the app will not be available for ... |
07:52
π
|
SmileyG |
... download via the App Store, the Google Play Store, and the BlackBerry App World. If you have any questions or queries, please email us at support@tu.com. |
07:53
π
|
SmileyG |
Has a forum, can someone grab? |
07:53
π
|
SmileyG |
i may no longer have access to this machine later, so yeah. |
07:56
π
|
ivan` |
I am warc'ing support. and www. |
07:57
π
|
SmileyG |
ty |
07:57
π
|
SmileyG |
hell, warc the whole site :) (i'm guessing that pretty much IS the whole thing). |
08:12
π
|
SmileyG |
-rw-r--r-- 1 tim.bowers games 982M Aug 7 00:11 ./tribes_forum_06082013.warc |
08:12
π
|
SmileyG |
and then it 404'ed |
08:14
π
|
SmileyG |
no cdx tho o_O |
19:47
π
|
ATZ0 |
So... I've gone off on a rant about this in the past: http://www.zdnet.com/aol-patch-upheaval-hundreds-of-layoffs-but-also-new-ceo-7000019218/ - we can has start mirroring patch.com sites? |
19:48
π
|
ATZ0 |
I looked into at at like 1am one evening, and didn't take notes, but when I tried wget they were blocking my IP After X requests |
19:49
π
|
ATZ0 |
As for why this is important - "As for the present, 400 Patch sites are on the chopping block to either be merged with another site -- or shuttered altogether." |
19:49
π
|
antomatic |
erk |
19:49
π
|
ATZ0 |
Given the closure of small newspapers, these sites represent the last sliver of local media in a lot of these markets |
19:58
π
|
ATZ0 |
i'll fill in what detail i know. basically - http://www.patch.com/ , click a state, you get the local patch sites that pop up. |
19:58
π
|
ATZ0 |
the sites aggregate/share content across multiple local sites, so if you can somehow dedupe the storage it should take up a lot less room. |
20:00
π
|
ATZ0 |
wget pointed at the root of the site ie lakeridge.patch.com with appropriate options was working for me, but after a certain number of grabs it was then throwing me back a "You've been doing that too much" or similar message, which I figured out was IP-tied (non-cookie based), and if I recall correctly after 20-30 minutes it allowed resumption of donwloading |
20:01
π
|
ATZ0 |
not sure if that throttling is per local site, or patch.com overall. |
20:01
π
|
antomatic |
any idea what kind of size a patch.com local site is, on disc, roughly? |
20:02
π
|
ATZ0 |
it's going to vary based on the age of the site |
20:02
π
|
ATZ0 |
they didn't all just spring up at once |
20:02
π
|
antomatic |
ah |
20:02
π
|
ATZ0 |
plus the frequency/usage of it. say some oklahoma site who has a very active editor and a large user population might be 4x the size of a lesser used site where it never caught on |
20:07
π
|
ATZ0 |
http://techcrunch.com/2013/08/09/armstrong-confirms-hundreds-of-layoffs-at-patch-400-sites-shuttered-or-partnered-off-and-a-new-ceo/ - this is more urgent then i thought |
20:07
π
|
ATZ0 |
not 400 possibly closing, 400 will be closing/"partnered" |
20:07
π
|
ATZ0 |
let's not let AOL destroy the last 3 years of local news in some places |
20:08
π
|
antomatic |
seems like a good choice for archiving. |
20:12
π
|
ATZ0 |
and now that i've waved my arms in the air and screamed fire, i'll let the heroes who know how to actually script this stuff hopefully run with it and let my happy archivteam warrior contribute to the mongol horde. |
20:15
π
|
antomatic |
Crikey, from the page source of patch.com I make it 909 individual sites |
20:15
π
|
ATZ0 |
that sounds about right. |
20:16
π
|
godane |
so it turns out nfts-3g frezzes slackware when copying files |
20:17
π
|
godane |
i thought it was cause i was copying + downloading + uploading from the same drive |
20:17
π
|
godane |
to another ntfs drive |
20:30
π
|
antomatic |
http://archiveteam.org/index.php?title=List_of_Patch.com_sites |
20:32
π
|
ATZ0 |
you know, when you say 909 it doesn't sound imposing but then that list ... |
20:32
π
|
ATZ0 |
O_o |
20:34
π
|
antomatic |
Aah... how long could it possibly take? :) |
20:35
π
|
ATZ0 |
other than the throttling, not that long i think |
20:35
π
|
DFJustin |
we downloaded a million splinder sites :P |
20:53
π
|
yipdw |
well, I'm grabbing altadena.patch.com as a test |
20:55
π
|
yipdw |
seems like the usual wget-warc commands are doing fine |
20:56
π
|
ATZ0 |
i'm curious if it starts to die on you like it did for me |
20:56
π
|
yipdw |
define die |
20:56
π
|
yipdw |
I'm running with --random-wait, --wait 1 |
20:56
π
|
ATZ0 |
started getting html pages returned with throttling, but i probably wasnt using the wait like that |
20:56
π
|
yipdw |
so wget waits between 0.5 and 1.5 seconds between requests |
20:56
π
|
ATZ0 |
it was late, i may have been drunk :) |
20:57
π
|
yipdw |
or more specifically, I'm running https://gist.github.com/yipdw/04e3883a9cdb87735fc4 |
21:04
π
|
antomatic |
I suppose the other question is how many of these sites are already in the Wayback machine with reasonably-recent crawls. |
21:04
π
|
antomatic |
hm, might have a way to do that.. |
21:05
π
|
yipdw |
antomatic: one site per person might be doable |
21:05
π
|
yipdw |
e.g. with a tracker |
21:06
π
|
yipdw |
I'll set one up |
21:07
π
|
yipdw |
SmileyG: can you get a project on tracker.archiveteam.org set up for this patch grab? |
21:08
π
|
yipdw |
in the meantime, I'll deploy universal-tracker elsewhere |
21:19
π
|
antomatic |
Right, got a script running at the moment checking which patches are in and not in wayback |
21:19
π
|
antomatic |
(suspect they all will be, but obviously of differing ages) |
21:19
π
|
ATZ0 |
i feel like Hannibal Smith, I love it when a plan comes together |
21:20
π
|
ATZ0 |
(insert parody A-Team logo here) |
21:22
π
|
ATZ0 |
in 2013, a crack-addeled-braine website of an idea was sent to death by a CEO for a fate they didn't deserve. This website promptly escaped near death from a maximum-security disk wipe to the ArchiveTeam underground. Today, still wanted by the corporation that spawned it, it survives as an archive of history. If you have a problem...if no one else can help... and if you can find them... |
21:22
π
|
ATZ0 |
maybe you can hire...The Archive Team. |
21:26
π
|
Nemo_bis |
nice https://www.eff.org/press/releases/judge-grants-preliminary-injunction-protect-free-speech-after-eff-challenge |
21:32
π
|
antomatic |
The results, like the doctor, is IN. Or are in. Hm. Anyway... |
21:33
π
|
antomatic |
Yes, every single Patch has some presence in Wayback. |
21:33
π
|
antomatic |
That might have been a stupidly obvious question, though. |
21:34
π
|
antomatic |
Most of the few crawls I've checked manually seem to date from May 2013 |
21:35
π
|
antomatic |
Some later, too. |
21:39
π
|
godane |
uploaded: https://archive.org/details/Y2K_Family_Survival_Guide_With_Leonard_Nimoy_Palsojom1.X264.CG |
21:42
π
|
DFJustin |
nice find |
21:48
π
|
yipdw |
antomatic: well, good to know there's at least a backup |
21:48
π
|
antomatic |
(nod) |
21:48
π
|
antomatic |
May still be worth doing updated grabs, of course. |
21:58
π
|
omf_ |
We always try to grab closing sites because even if something is on IA there is no guarantee of the depth of that crawl. |
21:58
π
|
antomatic |
(nods) |
21:59
π
|
omf_ |
and crawl frequency only indicates the newer content was grabbed. |
21:59
π
|
omf_ |
having stuff already on IA helps us for URL discovery which is annoying at best |
22:00
π
|
omf_ |
I will do a common crawl search as well, it takes a while though since it is a 22gb bz2 file :D |
22:01
π
|
omf_ |
I think it too 240 minutes last time |
22:01
π
|
omf_ |
I know, I know I need to upgrade to SSD |
22:02
π
|
antomatic |
I cheated by scraping every site out of the source of www.patch.com - but I guess if the common crawl knows of any sites that might not be indexed from Patch.com itself..? |
22:02
π
|
antomatic |
Don't know if any older sites have come and gone in patch's lifetime |
22:02
π
|
antomatic |
etc. |
22:08
π
|
godane |
so is anyone going to go after theisozone.com? |
22:08
π
|
godane |
i ask cause i find lots of stuff on there |
22:08
π
|
godane |
i'm getting a psm dec 2004 iso |
22:12
π
|
omf_ |
I will look into it now godane |
22:13
π
|
omf_ |
ugh cloudstore file hosting |
22:19
π
|
godane |
i know |
22:21
π
|
godane |
good news is that cloudstore supports resume |
22:30
π
|
omf_ |
That is not the problem with sites like these. You have to drive a javascript supported program in order to trigger the download on the cloudstores web page, a link to the file is not exposed. Granted some of these shit sites already have workaround libraries and tools |
22:31
π
|
omf_ |
The upside is theisozone has a very sane url scheme |
22:35
π
|
godane |
i got a head start on mirror glenn beck highlights |
22:35
π
|
godane |
trying to mirror that cause i don't really have video of everything on his network |
22:36
π
|
godane |
this sort of solves this and this stuff should hopefully not need to go dark cause its on the guys website |
22:36
π
|
godane |
for free |
23:06
π
|
godane |
now cloudstore has something special |
23:07
π
|
godane |
i think the url only 1 hour |
23:07
π
|
godane |
but you grab a new url and keep going |
23:42
π
|
godane |
that was a waste of time |
23:43
π
|
godane |
in less your download small files on theisozone your most likely will be out of luck |