Time |
Nickname |
Message |
00:49
🔗
|
Asparagir |
So, I know a very active forum that does not keep any content older than five days -- it's a very simple, threaded system and old stuff just ages off. |
00:49
🔗
|
Asparagir |
And their contents have apparently never been crawled by the Wayback Machine because their robots.txt explicitly disallows it. |
00:49
🔗
|
Asparagir |
But the content is really interesting and probably historcially important and a lot of semi-famous people post there. |
00:50
🔗
|
Asparagir |
Sooooo...if I start crawling it on a regular basis with wget with robots=off and submit the WARC's to the IA...would that be bad? |
00:50
🔗
|
Asparagir |
More importantly, could I potentially get in any kind of trouble? Robots.txt is not as binding as a user agreement, right? |
00:53
🔗
|
Cameron_D |
The guys running the site might yell at/block you for not obeying it, but that is about all they can really do |
00:55
🔗
|
Asparagir |
If I submit the WARC's to IA as part of an ArchiveTeam grab, will the content eventually find its way into the Wayback Machine, even if the robots.txt stays so restrictive? |
00:57
🔗
|
Cameron_D |
I don't think the wayback will let you browse it until the file is removed |
00:57
🔗
|
Cameron_D |
Like: http://web.archive.org/web/http://twitter.com/ |
01:00
🔗
|
Asparagir |
Hrmmm. But at least I would know the content exists somewhere... It's ironic that this message board is so ephemeral in nature, and yet it's talking about major works of art and culture, and often gets posts from people that historians of the future will be studying. |
01:01
🔗
|
Cameron_D |
Yeah, and the individual WARCs will always be available for download |
01:03
🔗
|
Asparagir |
Okay, I'm comvinced. Will need to brush up on my bash scripting and cron kung-fu so I can get this thing scraped and uploaded to IA on some kind of regular schedule. |
01:03
🔗
|
Asparagir |
convinced, even. |
01:32
🔗
|
omf_ |
Asparagir, do it. The warcs could always be displayed elsewhere. I mean there are like multiple geocities mirrors online in the wild now |
01:33
🔗
|
Asparagir |
Project: JazzHands is a go. Setting up the cloud server now. |
01:35
🔗
|
DFJustin |
it's easier to take something down later than to magic it back when it doesn't exist anymore |
01:45
🔗
|
* |
turnip is away: idk probs arma |
01:45
🔗
|
* |
turnip is back (gone 00:00:09) |
02:15
🔗
|
Asparagir |
Project JazzHands up and running. Cron will call a bash script every few days to index the forum with wget and then submit it to IA with curl. Fosse and Sondheim lovers of the future, you're welcome. |
02:41
🔗
|
SketchCow |
Excellent. |
02:41
🔗
|
SketchCow |
Nail that crap |
03:22
🔗
|
xmc |
\o/ |
03:23
🔗
|
SketchCow |
I think we really need to take advantage of this "lull" to clean things up |
03:26
🔗
|
Asparagir |
I don't know if anyone wants to add them to the Wiki, but I put together two little Gists with simple code for fellow newbies to use when crawling sites and then submitting to the IA: |
03:26
🔗
|
Asparagir |
https://gist.github.com/Asparagirl |
04:41
🔗
|
SketchCow |
http://jsmess.textfiles.com/messbeta.html?module=a800 now has a pile of Atari 800 games |
04:46
🔗
|
yipdw |
Asparagir: there's similar stuff at http://archiveteam.org/index.php?title=Wget#Creating_WARC_with_wget |
04:46
🔗
|
yipdw |
Asparagir: but it could definitely stand to be cleaned up, especially e.g. given a more obvious title |
04:46
🔗
|
yipdw |
and front-page billing |
05:16
🔗
|
Cameron_D |
Got an email the other day, GameArena (http://www.gamearena.com.au/ ) are closing their forums (2.8 million posts) and downloads on September 9 |
07:44
🔗
|
sammo |
hi |
07:44
🔗
|
sammo |
need help with my facebook archive |
07:45
🔗
|
sammo |
how long the archive email will be send? |
07:47
🔗
|
ersi |
How do you mean? |
07:49
🔗
|
SmileyG |
sammo: we don't know, we don't run it or have anything to do with facebook, we simply advise you on how to get your data out. |
08:02
🔗
|
sammo |
emm,this is not facebook support ? |
08:03
🔗
|
sammo |
>_< |
08:06
🔗
|
SmileyG |
sammo: nope. |
08:06
🔗
|
sammo |
owh, ok, |
08:06
🔗
|
SmileyG |
This is ArchiveTeam, we archive websites which are shutting down, and help users grab their data. |
08:06
🔗
|
sammo |
thanks for reply, |
08:06
🔗
|
SmileyG |
Facebook has some "archive tools" built into it, you've likely seen a page about them |
08:07
🔗
|
sammo |
yup, |
08:07
🔗
|
sammo |
they said will email me the download link once archive, but already 2 day i havent receive any email, |
08:09
🔗
|
SmileyG |
I guess if your account has lots of content, it might take awhile (or their servers are overloaded as they prob don't want people getting their data out). |
08:09
🔗
|
SmileyG |
sammo: Can I ask how you ended up here btw? Did you find our wiki or something? |
08:09
🔗
|
sammo |
owh,ok, |
08:09
🔗
|
sammo |
ya , wiki, |
08:10
🔗
|
SmileyG |
Cool :) |
08:10
🔗
|
SmileyG |
Well as I said, we can't help you anymore than that, but if you'd like to discuss other things, feel free to join #archiveteam-bs where we chat about anything and everything. We try and keep this channel clear for ArchiveTeam issues. |
08:10
🔗
|
sammo |
you guys didnt work for facebook? |
08:10
🔗
|
SmileyG |
Nope. No association at all. |
08:11
🔗
|
sammo |
owh, i though you guy were the rich programer LOL |
08:11
🔗
|
sammo |
thanks again, |
11:51
🔗
|
ponas |
^-- heh, I've tried to download my facebook content several times the past few years. never get the damn email. |
11:53
🔗
|
ersi |
that sucks :/ |
11:56
🔗
|
BlueMax |
talk to the book cause the face ain't listenin' |
11:56
🔗
|
BlueMax |
...I'm sorry |
11:56
🔗
|
ersi |
No you aren't |
11:57
🔗
|
ersi |
Does Facebook have any kind of "Support"? |
11:59
🔗
|
SmileyG |
not that is real, no |
11:59
🔗
|
SmileyG |
unless you are law enforcement |
12:00
🔗
|
BlueMax |
kind of odd that the biggest social network in the world doesn't have real support |
12:00
🔗
|
SmileyG |
try contacting them |
12:00
🔗
|
SmileyG |
it's not fun. |
12:00
🔗
|
SmileyG |
Like when there was some xss glitch which allowed site to send messages as yourself. |
12:01
🔗
|
SmileyG |
I could see it happening, had documented it, tried to report it, was ignored. |
15:36
🔗
|
Tephra |
benn away a couple days, anyone got this: http://www.zeroshare.info/? |
15:42
🔗
|
godane |
i'm grabing it |
15:46
🔗
|
SketchCow |
Cameron_D: That IS important |
15:48
🔗
|
godane |
uploaded: http://archive.org/details/www.zeroshare.info-20130816 |
15:52
🔗
|
godane |
pc marketplace is closing: http://support.xbox.com/en-US/games/pc-games/pc-marketplace-closing |
16:33
🔗
|
antomatic |
sportsinreview.com was also written by the same author as the zeroshare site - it too has a final post on its front page |
16:35
🔗
|
ATZ0 |
patch.com update - 60% of sites will continue, 20% to partner with other outlets, 20% consoldiated or completely closed. 480 patch.com employees losing jobs today: http://jimromenesko.com/2013/08/16/aol-boss-tim-armstrong-says-40-of-patch-workforce-will-be-laid-off/ |
16:45
🔗
|
Tephra |
godane: thanks, fast work! |
16:52
🔗
|
RedType |
godane: im surprised they didnt do this when win 8 store came out |
16:53
🔗
|
omf_ |
They can only handle so much bad PR at a time ;) |
17:01
🔗
|
SketchCow |
http://ascii.textfiles.com/archives/4029 |
18:23
🔗
|
SketchCow |
Could someone please WARC-WGET http://martinmanleylifeanddeath.com/ |
18:23
🔗
|
SketchCow |
Won't be big. |
18:34
🔗
|
godane |
i'm mirroring it right now |
18:35
🔗
|
godane |
i got the zeroshare.info mirrored |
18:36
🔗
|
SketchCow |
Thanks. |
18:37
🔗
|
godane |
i'm uploading kevin rose's foundation series |
18:37
🔗
|
godane |
lots of interviews on there |
18:38
🔗
|
godane |
revision 3 stop making releases of it on there site after episode 29 |
18:39
🔗
|
godane |
so google ventures as a key word |
18:40
🔗
|
godane |
will most likely add Google Ventures as the creator from episode 30 on |
18:43
🔗
|
godane |
may even do it from episode 21 |
18:48
🔗
|
godane |
uploaded: http://archive.org/details/martinmanleylifeanddeath.com-20130816 |
19:05
🔗
|
godane |
uploaded: http://archive.org/details/Foundation_1 |
19:51
🔗
|
ersi |
Uh, creepy/cool - Google has already indexed the items I've uploaded to IA |
19:57
🔗
|
antomatic |
Anyone grabbing sportsinreview.com ? (Martin Manley's other site) |
20:06
🔗
|
Tephra |
I can get it |
20:06
🔗
|
antomatic |
thanks tephra! |
20:10
🔗
|
omf_ |
My first pass at a yahoo groups list is almost done |
20:10
🔗
|
winr4r |
hey omf_! |
20:11
🔗
|
omf_ |
winr4r where you been at? |
20:11
🔗
|
winr4r |
omf_: working away! |
20:11
🔗
|
winr4r |
i got hired as a contractor at a place for four weeks |
20:11
🔗
|
winr4r |
i just finished the first two |
20:11
🔗
|
omf_ |
excellent |
20:11
🔗
|
winr4r |
(as a web developer) |
20:12
🔗
|
winr4r |
it is actually multiple times as much as i have ever earned in the same time period in my life |
20:12
🔗
|
winr4r |
so, things are good |
20:18
🔗
|
xmc |
winr4r: yay, money! |
20:28
🔗
|
winr4r |
for someone who is used to living very cheaply, it is weird having money |
20:30
🔗
|
antomatic |
I'm (attempting) to grab http://uponfurtherreview.blog.com/ which is an older version of SportsInReview but with open comments on the articles |
20:30
🔗
|
antomatic |
This whole thing is so sad. |
20:34
🔗
|
Tephra |
it is |
20:34
🔗
|
winr4r |
wait, what did i miss? |
20:36
🔗
|
antomatic |
Martin Manley (former sports writer, described as a 'math genius') - had dementia, killed himself on his 60th birthday yesterday. Put up a whole website about his life and why he decided to end it. |
20:36
🔗
|
Tephra |
http://www.zeroshare.info/ |
20:37
🔗
|
antomatic |
Also martinmanleylifeanddeath.com (zeroshare is a mirror) |
20:37
🔗
|
SketchCow |
http://www.atarimania.com/documents-atari-400-800-xl-xe-books_1_8.html out of nowhere |
20:38
🔗
|
antomatic |
Paid Yahoo for 5 years hosting. |
20:40
🔗
|
winr4r |
SketchCow: huuuuuug |
20:41
🔗
|
Tephra |
what's the law for upholding a contract with a deceased person? (i.e can yahoo just take it down?) |
20:42
🔗
|
antomatic |
I guess that has to be a risk |
20:42
🔗
|
ersi |
AFAIK, though IANAL: Yes, they could probably terminate it. However frecking hilarious this feels to say; Yahoo has previously honoured payed users (dead or alive) |
20:43
🔗
|
SketchCow |
The important factor X is his family |
20:43
🔗
|
SketchCow |
they might choose to yank all that shit down |
20:43
🔗
|
SketchCow |
They could do it. |
20:43
🔗
|
SketchCow |
Hence, we grab |
20:43
🔗
|
ersi |
Indeed. |
20:43
🔗
|
antomatic |
(nods) |
20:43
🔗
|
Tephra |
yes, have a wget on his blog going |
21:27
🔗
|
Tephra |
antomatic: right, should have a mirror of sportsinreview (if wget didn't screw up) |
21:28
🔗
|
antomatic |
Cool. Still got uponfurtherreview going here (again, subject to wget) |
21:29
🔗
|
Tephra |
needs better archive software |
21:29
🔗
|
yipdw |
you can double-check your WARCs with https://github.com/alard/warc-proxy |
21:30
🔗
|
Tephra |
yipdw: thanks! |
21:38
🔗
|
Tephra |
sweet looks complete, now to upload my first item to IA then |
21:43
🔗
|
Tephra |
antomatic: is there a protocol to upload these grabs? |
21:46
🔗
|
DFJustin |
choose an item name with some combination of the website name and date of crawl, upload .warc.gz, select community text as the destination, pester jason to move it to the archive team collection |
21:47
🔗
|
Tephra |
DFJustin: thanks |
21:53
🔗
|
Tephra |
SketchCow: uploaded https://archive.org/details/Sportsinreview20130816 blog of Martin Manley |
22:18
🔗
|
dashcloud |
I haven't seen it mentioned here, so I'm passing it along: Google released an opensource HTML5 parser here: https://github.com/google/gumbo-parser |
22:23
🔗
|
omf_ |
yeah I started playing with it dashcloud |
22:23
🔗
|
omf_ |
look at how many pages they tested it on |
22:26
🔗
|
dashcloud |
that's a lot of pages |
22:27
🔗
|
omf_ |
;D |
22:27
🔗
|
omf_ |
that is real testing |
22:28
🔗
|
* |
ersi gets really excited, prior to even following the link |
22:29
🔗
|
ersi |
Oh man. |
22:31
🔗
|
omf_ |
Big data is fucking awesome for testing purposes |
23:04
🔗
|
Asparagir |
First JazzHands WARC from that forum I mentioned is up. More to come daily! |
23:04
🔗
|
Asparagir |
http://archive.org/details/project-jazzhands_-_talkin-broadway-all-that-chat_-_2013-08-16 |
23:05
🔗
|
Coderjoe |
hmm |
23:05
🔗
|
Coderjoe |
worth1000 closing |
23:07
🔗
|
Coderjoe |
at least it is going to be a static museum for the time being |
23:07
🔗
|
Coderjoe |
http://logo.worth1000.com/discussions/70353/the-future-of-worth1000-everyone-please-read |
23:27
🔗
|
ersi |
Asparagir: Nice |
23:27
🔗
|
ersi |
Coderjoe: Yeah, it's been talked about |
23:28
🔗
|
antomatic |
Sketchcow: https://archive.org/details/Uponfurtherreview.blog.comPanicgrab20130815.warc |
23:29
🔗
|
antomatic |
(first IA upload.. sorry if I did it wrong.) :) |
23:29
🔗
|
ersi |
As long as you've uploaded it, nothing's wrong. Metadata can be edited/improved at any time |
23:29
🔗
|
antomatic |
phew ;) |
23:30
🔗
|
ersi |
Yay, I'm up to 10 items uploaded :3 |