#archiveteam 2013-03-12,Tue

↑back Search

Time Nickname Message
00:45 🔗 SketchCow The spam attack on the wiki has begun.
01:01 🔗 omf_ SketchCow, The oqotalk.com backup has completed.and I got a 1.4gb warc.gz
01:01 🔗 SketchCow Thanks.
01:01 🔗 SketchCow Sounds good.
01:01 🔗 SketchCow Can you upload it?
01:02 🔗 omf_ Just wanted to double check. The classic web interface can handle a file that size?
01:02 🔗 SketchCow Yes
01:02 🔗 SketchCow Well, wait, classic?
01:02 🔗 SketchCow The new one.
01:04 🔗 omf_ THe new one is the drag and drop thing?
01:04 🔗 SketchCow Yes
01:04 🔗 omf_ cause the flash based one does not function correctly on linux and I was just using the old web form for all the uploads I already did
01:04 🔗 omf_ the drag and drop thing has yet to work for me either
01:06 🔗 omf_ oh shit. http://www.oqotalk.com is down now
01:07 🔗 SketchCow Sweet
01:07 🔗 SketchCow Good timing, huh
01:07 🔗 omf_ yep, I checked the wget log and when it finished earlier there were no errors
01:08 🔗 omf_ I bet I ate up their bandwidth or something
01:08 🔗 SketchCow Probably.
01:18 🔗 omf_ It is uploading now.
01:54 🔗 ryan_ is the yahoo video thing still ongoing at all?
02:13 🔗 SketchCow No.
02:13 🔗 SketchCow I'm uploading the video.
02:20 🔗 Lord_Nigh SketchCow: use asirra or that colored text thing that bisqwit came up for for tasvideos forums to prevent spammer signups on wiki?
02:20 🔗 Lord_Nigh also you need a blanket 'edit rejector' which will reject edits with spamlinks or spamtext
02:21 🔗 Lord_Nigh since spammers sometimes hire human captcha-breakers
02:22 🔗 Lord_Nigh ooh here's a really nasty one: ask people to type, in english, what color a sequence of chinese characters each are. each chinese character translates in chinese to a DIFFERENT color than the one it is.
03:22 🔗 dashcloud with it being or getting close to flea market/garage sale season on the East coast, remember that you can help preserve history by buying other people's crap
03:29 🔗 DFJustin unfortunately it un-preserves the space in your house :(
03:40 🔗 dashcloud so box it up, keep what you want, then send the rest to a new home (you could be the physical equivalent of godane !)
05:14 🔗 ivan` is there a youtube-dl patch or other software that will let me resume downloading a channel without hitting 1000 pages for videos I've already downloaded?
05:14 🔗 ivan` (that kind of behavior tends to get you CAPTCHAed for life)
05:43 🔗 BlueMax Well I know of a software on Windows that's pretty good at it...
05:46 🔗 ivan` cool, what is it?
05:52 🔗 BlueMax http://www.dvdvideosoft.com/products/dvd/Free-YouTube-Download.htm
05:52 🔗 BlueMax It's ad-supported and only works on Windows but it's still good software.
05:54 🔗 BlueMax (well it might work via WINE but I can't promise that)
05:57 🔗 ivan` thanks
05:58 🔗 ivan` I'll probably hack youtube-dl now until I have something terrible working
06:56 🔗 omf_ SketchCow, you have all of oqotalk now
06:59 🔗 SketchCow Saw.
06:59 🔗 SketchCow http://archive.org/details/archiveteam_oqotalkcom_2012_03_panic
07:49 🔗 godane good news
07:49 🔗 godane i can now scan though google search results
07:53 🔗 omf_ I think that last statement needs a little more detail
07:54 🔗 godane i'm making a index based on search results of files from cscope.us
07:54 🔗 godane search="site:cscope.us+filetype:pdf"
07:55 🔗 godane i think do a for i in $(seq 1 to 30); do
07:55 🔗 godane echo "http://google.com/search?q=$search&start=${i}0" >> index.txt
07:55 🔗 godane then something like this:
07:55 🔗 godane wget -x -i index.txt --user-agent="Firefox/3.0.15" --warc-file=google-cscope.us --warc-cdx -w 5
07:55 🔗 godane then this:
07:55 🔗 godane zcat *.warc.gz | grep -ohP 'href='[^'>]+' | grep 'q=' | grep 'www.cscope.us' | sed 's|.*url?q=||g' | sed 's|&.*||g' | grep -v 'webcache.google' | grep http
07:56 🔗 godane you guys only have 55 pdfs in wayback so this should be something
08:20 🔗 godane i got the index
08:42 🔗 godane i'm uploading my grab of cscope pdfs right now
09:06 🔗 godane uploaded: https://archive.org/details/cscope.us-google-pdfs-grab-20130312
09:31 🔗 ivan` youtube-dl has a bug that is breaking /user/ URLs, you have to move YoutubeUserIE above YoutubePlaylistIE
10:05 🔗 godane so looks like there are 71 inactive videos missing
10:05 🔗 godane in the 48000ids
11:20 🔗 ersi http://blog.archive.org/2013/03/12/riding-with-the-bit-savers/
15:09 🔗 godane good news
15:09 🔗 godane based on warc-proxy my warcs of the forums last month work just fine
16:10 🔗 Smiley alard: you've stopped showing the time left on the Available Projects page?
16:10 🔗 Smiley Or was it not ever there?
17:37 🔗 SketchCow Ha ha, I jumped the chain of command posting that.
17:37 🔗 SketchCow Small error, turns out setting a checkbox on the blog software automatically promotes to front page
17:38 🔗 DFJustin yeah man you're drowning out vital bitcoin news
17:39 🔗 ersi SketchCow: ;D
17:39 🔗 ersi SketchCow: I like the "Movie showing template" that was posted a few days ago as well (not yours though)
17:56 🔗 godane this is video doesn't exist it looks like: http://archive.org/details/g4tv.com-video36800
18:08 🔗 godane ok guys
18:09 🔗 godane the forum uploads from feb 2013 will have s= links
18:09 🔗 godane it only worked in warc viewer cause it was using cached data of forums.g4tv.com
19:43 🔗 SketchCow We're out of the "woods" with space on FOS.
19:43 🔗 SketchCow 5.7tb free, enough for the time being.
19:47 🔗 ivan` https://github.com/ludios/youtube-dl/commits/prime my youtube-dl experience is much better now
19:57 🔗 Smiley more ops plz
20:08 🔗 balrog_ submit pull request, ivan` :)
20:09 🔗 DFJustin yeah I could definitely use some of those changes
20:12 🔗 ivan` balrog_: these are some pretty low-quality diffs
20:12 🔗 ivan` the end result is good for me though
20:14 🔗 ivan` I am assuming that the filenames on the filesystem are a certain format
20:14 🔗 ivan` the blip.tv change doesn't fall back to non-Source when Source isn't available
20:14 🔗 ivan` the sleeping is sleeping thrice in a row for reasons unknown to me
21:18 🔗 brianhick hey, I've just downloaded the archiveteam-warrior after reading http://jacquesmattheij.com/come-help-save-posterous-from-oblivion# and it says to ask here before starting the posterous project - what do I need to know?
21:19 🔗 ersi Come join us in #preposterus (Project specific channel for Posterous)
21:21 🔗 ersi The warning/notice was put into the Warrior's project page because; Posterous might ban your IP. That will make you unable to browse any Posterous blogs/spaces. If you've read that, select it and go on :-)
21:22 🔗 brianhick I'm alright with that - all the ones I read have moved. Thanks for the heads up.
21:23 🔗 ersi There's more activity in #preposterus btw, since that's project specific :) That's where all the project updates happen as well
21:24 🔗 brianhick many thanks, I'll try there.
21:34 🔗 ersi If you're here regarding Jacques Mattheij's HackerNews post about Posterous, the project specific channel (Where everything interesting happens) is at #preposterus (on this very same IRC Network)
21:34 🔗 balrog_ someone put that in the topic
21:35 🔗 ersi Hmm, I guess I'll save the current one and put it back later
21:36 🔗 balrog_ no I meant add it to the topic
21:37 🔗 ersi I know. I'm doing it right now. Chill pill.
21:38 🔗 ersi stupid topic charlimit
22:09 🔗 CoJaBo Does anyone know- Are there any other long-lived archives other than Arcive.org/Wayback that archive webpages in general?
22:12 🔗 Andres_ uhm
22:13 🔗 Andres_ well there's always google web cache IMO
22:13 🔗 Andres_ although
22:13 🔗 Andres_ it's not a permanent archive iirc
22:13 🔗 Andres_ nor it's an archive at all
22:14 🔗 Andres_ you have webcitation.org
22:14 🔗 Andres_ too
22:14 🔗 Andres_ it's for references more than archiving
22:14 🔗 Andres_ but it's decent
22:15 🔗 Andres_ ersi,
22:15 🔗 ersi Heya
22:15 🔗 Andres_ have you spoken to webcitation.org staff
22:15 🔗 Andres_ they seem to be a little troubled
22:15 🔗 ersi About? Posterous?
22:15 🔗 Andres_ re: https://fundrazr.com/campaigns/aQMp7
22:15 🔗 ersi Oh
22:15 🔗 Andres_ no
22:15 🔗 Andres_ about themselves
22:16 🔗 ersi No, I havn't.
22:16 🔗 CoJaBo I'm mostly interested in ones that crawl automatically; there was an incident where an... unscrupulous company decided to upload large numbers of sensitive documents to a site; these need to be removed from as many places as possible.
22:16 🔗 ersi Looks like they're dying :-/
22:16 🔗 ersi CoJaBo: There's Common Crawl.
22:16 🔗 ersi It's not an archive per say, but it effectively is.
22:16 🔗 Andres_ someone should contact 'em
22:16 🔗 Andres_ and ask for a backup
22:16 🔗 Andres_ .torrent or something
22:17 🔗 ersi Calm down with that enter button ;)
22:17 🔗 balrog_ btw yahoo message boards is shutting down in about half a month
22:17 🔗 balrog_ anyone doing anything about that???
22:17 🔗 ersi There's #BurnTheMessenger and there's a project page on ArchiveTeam wiki
22:17 🔗 Andres_ :p sorry, this bad habit is really old
22:17 🔗 Andres_ started since I started IRCing at DALnet
22:17 🔗 Andres_ bad habits never die
22:18 🔗 ersi Please keep traffic in this channel to a low. #archiveteam-bs is for freefloat chat and then there's the project channels.
22:19 🔗 CoJaBo ersi: huh.. is there any way to remove data from there?
22:19 🔗 Andres_ CoJaBo, what files did they upload, just wondering
22:19 🔗 CoJaBo Andres_: Everything they had access to
22:20 🔗 CoJaBo It was one of those outsourcing web development companies
22:20 🔗 ersi CoJaBo: From Common Crawl?
22:20 🔗 CoJaBo ersi: Yeh, for starters..
22:25 🔗 ersi CoJaBo: I'd try contacting them and asking nicely.
22:25 🔗 ersi I'd consider leaving it there though. In my opinion, if something's been public, let it be public (ish)
22:29 🔗 CoJaBo ersi: Yeh, the customers prolly wouldn't appreciate showing up there tho lol..
22:31 🔗 ersi CoJaBo: Ah, aight. Well, a friendly nod should work.
22:32 🔗 CoJaBo Is there a way to search their data to see if its even there? Or do you need amazon or whatever to do that..
22:32 🔗 CoJaBo Itd def. on Archive.org tho >_>
22:32 🔗 CoJaBo Hell, wtf..... Someone got c99shell on there too, niiiiiiiiccccccceeeeeeeeeeeeeee <_<
22:33 🔗 ersi If it's in the Internet Archive Wayback machine, contact them on info@archive.org and they'll help you out. Please keep in mind that they're not many people and it can take a little while.
22:34 🔗 CoJaBo I think I can just do the Robots.txt thing; tho I guess that way is permanent isnt it..
22:35 🔗 ersi Yes, you can exclude it by adding a robots.txt to the domains that are effected.
22:35 🔗 CoJaBo Ah hell, thats right, they uploaded it too their own site too :/
22:35 🔗 ersi The wayback machine will poll the current domain/robots.txt before showing something from the wayback archives. If it excludes, it won't show it.
22:35 🔗 ersi Yeah, just contact the archive and they'll probably help you out. No biggie.
22:36 🔗 CoJaBo Yeh..... Advice- NEVER hire outsourcing companies LOL...........
22:37 🔗 ersi Yeah, heh.
22:37 🔗 ersi == If you're here regarding Jacques Mattheij's HackerNews post about Posterous, the project specific channel (Where everything interesting happens) is at #preposterus (on this very same IRC Network) ==
22:38 🔗 ersi (Saw a bunch of new clients join up)

irclogger-viewer