#archiveteam 2013-08-16,Fri

↑back Search

Time Nickname Message
00:49 🔗 Asparagir So, I know a very active forum that does not keep any content older than five days -- it's a very simple, threaded system and old stuff just ages off.
00:49 🔗 Asparagir And their contents have apparently never been crawled by the Wayback Machine because their robots.txt explicitly disallows it.
00:49 🔗 Asparagir But the content is really interesting and probably historcially important and a lot of semi-famous people post there.
00:50 🔗 Asparagir Sooooo...if I start crawling it on a regular basis with wget with robots=off and submit the WARC's to the IA...would that be bad?
00:50 🔗 Asparagir More importantly, could I potentially get in any kind of trouble? Robots.txt is not as binding as a user agreement, right?
00:53 🔗 Cameron_D The guys running the site might yell at/block you for not obeying it, but that is about all they can really do
00:55 🔗 Asparagir If I submit the WARC's to IA as part of an ArchiveTeam grab, will the content eventually find its way into the Wayback Machine, even if the robots.txt stays so restrictive?
00:57 🔗 Cameron_D I don't think the wayback will let you browse it until the file is removed
00:57 🔗 Cameron_D Like: http://web.archive.org/web/http://twitter.com/
01:00 🔗 Asparagir Hrmmm. But at least I would know the content exists somewhere... It's ironic that this message board is so ephemeral in nature, and yet it's talking about major works of art and culture, and often gets posts from people that historians of the future will be studying.
01:01 🔗 Cameron_D Yeah, and the individual WARCs will always be available for download
01:03 🔗 Asparagir Okay, I'm comvinced. Will need to brush up on my bash scripting and cron kung-fu so I can get this thing scraped and uploaded to IA on some kind of regular schedule.
01:03 🔗 Asparagir convinced, even.
01:32 🔗 omf_ Asparagir, do it. The warcs could always be displayed elsewhere. I mean there are like multiple geocities mirrors online in the wild now
01:33 🔗 Asparagir Project: JazzHands is a go. Setting up the cloud server now.
01:35 🔗 DFJustin it's easier to take something down later than to magic it back when it doesn't exist anymore
01:45 🔗 * turnip is away: idk probs arma
01:45 🔗 * turnip is back (gone 00:00:09)
02:15 🔗 Asparagir Project JazzHands up and running. Cron will call a bash script every few days to index the forum with wget and then submit it to IA with curl. Fosse and Sondheim lovers of the future, you're welcome.
02:41 🔗 SketchCow Excellent.
02:41 🔗 SketchCow Nail that crap
03:22 🔗 xmc \o/
03:23 🔗 SketchCow I think we really need to take advantage of this "lull" to clean things up
03:26 🔗 Asparagir I don't know if anyone wants to add them to the Wiki, but I put together two little Gists with simple code for fellow newbies to use when crawling sites and then submitting to the IA:
03:26 🔗 Asparagir https://gist.github.com/Asparagirl
04:41 🔗 SketchCow http://jsmess.textfiles.com/messbeta.html?module=a800 now has a pile of Atari 800 games
04:46 🔗 yipdw Asparagir: there's similar stuff at http://archiveteam.org/index.php?title=Wget#Creating_WARC_with_wget
04:46 🔗 yipdw Asparagir: but it could definitely stand to be cleaned up, especially e.g. given a more obvious title
04:46 🔗 yipdw and front-page billing
05:16 🔗 Cameron_D Got an email the other day, GameArena (http://www.gamearena.com.au/ ) are closing their forums (2.8 million posts) and downloads on September 9
07:44 🔗 sammo hi
07:44 🔗 sammo need help with my facebook archive
07:45 🔗 sammo how long the archive email will be send?
07:47 🔗 ersi How do you mean?
07:49 🔗 SmileyG sammo: we don't know, we don't run it or have anything to do with facebook, we simply advise you on how to get your data out.
08:02 🔗 sammo emm,this is not facebook support ?
08:03 🔗 sammo >_<
08:06 🔗 SmileyG sammo: nope.
08:06 🔗 sammo owh, ok,
08:06 🔗 SmileyG This is ArchiveTeam, we archive websites which are shutting down, and help users grab their data.
08:06 🔗 sammo thanks for reply,
08:06 🔗 SmileyG Facebook has some "archive tools" built into it, you've likely seen a page about them
08:07 🔗 sammo yup,
08:07 🔗 sammo they said will email me the download link once archive, but already 2 day i havent receive any email,
08:09 🔗 SmileyG I guess if your account has lots of content, it might take awhile (or their servers are overloaded as they prob don't want people getting their data out).
08:09 🔗 SmileyG sammo: Can I ask how you ended up here btw? Did you find our wiki or something?
08:09 🔗 sammo owh,ok,
08:09 🔗 sammo ya , wiki,
08:10 🔗 SmileyG Cool :)
08:10 🔗 SmileyG Well as I said, we can't help you anymore than that, but if you'd like to discuss other things, feel free to join #archiveteam-bs where we chat about anything and everything. We try and keep this channel clear for ArchiveTeam issues.
08:10 🔗 sammo you guys didnt work for facebook?
08:10 🔗 SmileyG Nope. No association at all.
08:11 🔗 sammo owh, i though you guy were the rich programer LOL
08:11 🔗 sammo thanks again,
11:51 🔗 ponas ^-- heh, I've tried to download my facebook content several times the past few years. never get the damn email.
11:53 🔗 ersi that sucks :/
11:56 🔗 BlueMax talk to the book cause the face ain't listenin'
11:56 🔗 BlueMax ...I'm sorry
11:56 🔗 ersi No you aren't
11:57 🔗 ersi Does Facebook have any kind of "Support"?
11:59 🔗 SmileyG not that is real, no
11:59 🔗 SmileyG unless you are law enforcement
12:00 🔗 BlueMax kind of odd that the biggest social network in the world doesn't have real support
12:00 🔗 SmileyG try contacting them
12:00 🔗 SmileyG it's not fun.
12:00 🔗 SmileyG Like when there was some xss glitch which allowed site to send messages as yourself.
12:01 🔗 SmileyG I could see it happening, had documented it, tried to report it, was ignored.
15:36 🔗 Tephra benn away a couple days, anyone got this: http://www.zeroshare.info/?
15:42 🔗 godane i'm grabing it
15:46 🔗 SketchCow Cameron_D: That IS important
15:48 🔗 godane uploaded: http://archive.org/details/www.zeroshare.info-20130816
15:52 🔗 godane pc marketplace is closing: http://support.xbox.com/en-US/games/pc-games/pc-marketplace-closing
16:33 🔗 antomatic sportsinreview.com was also written by the same author as the zeroshare site - it too has a final post on its front page
16:35 🔗 ATZ0 patch.com update - 60% of sites will continue, 20% to partner with other outlets, 20% consoldiated or completely closed. 480 patch.com employees losing jobs today: http://jimromenesko.com/2013/08/16/aol-boss-tim-armstrong-says-40-of-patch-workforce-will-be-laid-off/
16:45 🔗 Tephra godane: thanks, fast work!
16:52 🔗 RedType godane: im surprised they didnt do this when win 8 store came out
16:53 🔗 omf_ They can only handle so much bad PR at a time ;)
17:01 🔗 SketchCow http://ascii.textfiles.com/archives/4029
18:23 🔗 SketchCow Could someone please WARC-WGET http://martinmanleylifeanddeath.com/
18:23 🔗 SketchCow Won't be big.
18:34 🔗 godane i'm mirroring it right now
18:35 🔗 godane i got the zeroshare.info mirrored
18:36 🔗 SketchCow Thanks.
18:37 🔗 godane i'm uploading kevin rose's foundation series
18:37 🔗 godane lots of interviews on there
18:38 🔗 godane revision 3 stop making releases of it on there site after episode 29
18:39 🔗 godane so google ventures as a key word
18:40 🔗 godane will most likely add Google Ventures as the creator from episode 30 on
18:43 🔗 godane may even do it from episode 21
18:48 🔗 godane uploaded: http://archive.org/details/martinmanleylifeanddeath.com-20130816
19:05 🔗 godane uploaded: http://archive.org/details/Foundation_1
19:51 🔗 ersi Uh, creepy/cool - Google has already indexed the items I've uploaded to IA
19:57 🔗 antomatic Anyone grabbing sportsinreview.com ? (Martin Manley's other site)
20:06 🔗 Tephra I can get it
20:06 🔗 antomatic thanks tephra!
20:10 🔗 omf_ My first pass at a yahoo groups list is almost done
20:10 🔗 winr4r hey omf_!
20:11 🔗 omf_ winr4r where you been at?
20:11 🔗 winr4r omf_: working away!
20:11 🔗 winr4r i got hired as a contractor at a place for four weeks
20:11 🔗 winr4r i just finished the first two
20:11 🔗 omf_ excellent
20:11 🔗 winr4r (as a web developer)
20:12 🔗 winr4r it is actually multiple times as much as i have ever earned in the same time period in my life
20:12 🔗 winr4r so, things are good
20:18 🔗 xmc winr4r: yay, money!
20:28 🔗 winr4r for someone who is used to living very cheaply, it is weird having money
20:30 🔗 antomatic I'm (attempting) to grab http://uponfurtherreview.blog.com/ which is an older version of SportsInReview but with open comments on the articles
20:30 🔗 antomatic This whole thing is so sad.
20:34 🔗 Tephra it is
20:34 🔗 winr4r wait, what did i miss?
20:36 🔗 antomatic Martin Manley (former sports writer, described as a 'math genius') - had dementia, killed himself on his 60th birthday yesterday. Put up a whole website about his life and why he decided to end it.
20:36 🔗 Tephra http://www.zeroshare.info/
20:37 🔗 antomatic Also martinmanleylifeanddeath.com (zeroshare is a mirror)
20:37 🔗 SketchCow http://www.atarimania.com/documents-atari-400-800-xl-xe-books_1_8.html out of nowhere
20:38 🔗 antomatic Paid Yahoo for 5 years hosting.
20:40 🔗 winr4r SketchCow: huuuuuug
20:41 🔗 Tephra what's the law for upholding a contract with a deceased person? (i.e can yahoo just take it down?)
20:42 🔗 antomatic I guess that has to be a risk
20:42 🔗 ersi AFAIK, though IANAL: Yes, they could probably terminate it. However frecking hilarious this feels to say; Yahoo has previously honoured payed users (dead or alive)
20:43 🔗 SketchCow The important factor X is his family
20:43 🔗 SketchCow they might choose to yank all that shit down
20:43 🔗 SketchCow They could do it.
20:43 🔗 SketchCow Hence, we grab
20:43 🔗 ersi Indeed.
20:43 🔗 antomatic (nods)
20:43 🔗 Tephra yes, have a wget on his blog going
21:27 🔗 Tephra antomatic: right, should have a mirror of sportsinreview (if wget didn't screw up)
21:28 🔗 antomatic Cool. Still got uponfurtherreview going here (again, subject to wget)
21:29 🔗 Tephra needs better archive software
21:29 🔗 yipdw you can double-check your WARCs with https://github.com/alard/warc-proxy
21:30 🔗 Tephra yipdw: thanks!
21:38 🔗 Tephra sweet looks complete, now to upload my first item to IA then
21:43 🔗 Tephra antomatic: is there a protocol to upload these grabs?
21:46 🔗 DFJustin choose an item name with some combination of the website name and date of crawl, upload .warc.gz, select community text as the destination, pester jason to move it to the archive team collection
21:47 🔗 Tephra DFJustin: thanks
21:53 🔗 Tephra SketchCow: uploaded https://archive.org/details/Sportsinreview20130816 blog of Martin Manley
22:18 🔗 dashcloud I haven't seen it mentioned here, so I'm passing it along: Google released an opensource HTML5 parser here: https://github.com/google/gumbo-parser
22:23 🔗 omf_ yeah I started playing with it dashcloud
22:23 🔗 omf_ look at how many pages they tested it on
22:26 🔗 dashcloud that's a lot of pages
22:27 🔗 omf_ ;D
22:27 🔗 omf_ that is real testing
22:28 🔗 * ersi gets really excited, prior to even following the link
22:29 🔗 ersi Oh man.
22:31 🔗 omf_ Big data is fucking awesome for testing purposes
23:04 🔗 Asparagir First JazzHands WARC from that forum I mentioned is up. More to come daily!
23:04 🔗 Asparagir http://archive.org/details/project-jazzhands_-_talkin-broadway-all-that-chat_-_2013-08-16
23:05 🔗 Coderjoe hmm
23:05 🔗 Coderjoe worth1000 closing
23:07 🔗 Coderjoe at least it is going to be a static museum for the time being
23:07 🔗 Coderjoe http://logo.worth1000.com/discussions/70353/the-future-of-worth1000-everyone-please-read
23:27 🔗 ersi Asparagir: Nice
23:27 🔗 ersi Coderjoe: Yeah, it's been talked about
23:28 🔗 antomatic Sketchcow: https://archive.org/details/Uponfurtherreview.blog.comPanicgrab20130815.warc
23:29 🔗 antomatic (first IA upload.. sorry if I did it wrong.) :)
23:29 🔗 ersi As long as you've uploaded it, nothing's wrong. Metadata can be edited/improved at any time
23:29 🔗 antomatic phew ;)
23:30 🔗 ersi Yay, I'm up to 10 items uploaded :3

irclogger-viewer