[00:45] The spam attack on the wiki has begun. [01:01] SketchCow, The oqotalk.com backup has completed.and I got a 1.4gb warc.gz [01:01] Thanks. [01:01] Sounds good. [01:01] Can you upload it? [01:02] Just wanted to double check. The classic web interface can handle a file that size? [01:02] Yes [01:02] Well, wait, classic? [01:02] The new one. [01:04] THe new one is the drag and drop thing? [01:04] Yes [01:04] cause the flash based one does not function correctly on linux and I was just using the old web form for all the uploads I already did [01:04] the drag and drop thing has yet to work for me either [01:06] oh shit. http://www.oqotalk.com is down now [01:07] Sweet [01:07] Good timing, huh [01:07] yep, I checked the wget log and when it finished earlier there were no errors [01:08] I bet I ate up their bandwidth or something [01:08] Probably. [01:18] It is uploading now. [01:54] is the yahoo video thing still ongoing at all? [02:13] No. [02:13] I'm uploading the video. [02:20] SketchCow: use asirra or that colored text thing that bisqwit came up for for tasvideos forums to prevent spammer signups on wiki? [02:20] also you need a blanket 'edit rejector' which will reject edits with spamlinks or spamtext [02:21] since spammers sometimes hire human captcha-breakers [02:22] ooh here's a really nasty one: ask people to type, in english, what color a sequence of chinese characters each are. each chinese character translates in chinese to a DIFFERENT color than the one it is. [03:22] with it being or getting close to flea market/garage sale season on the East coast, remember that you can help preserve history by buying other people's crap [03:29] unfortunately it un-preserves the space in your house :( [03:40] so box it up, keep what you want, then send the rest to a new home (you could be the physical equivalent of godane !) [05:14] is there a youtube-dl patch or other software that will let me resume downloading a channel without hitting 1000 pages for videos I've already downloaded? [05:14] (that kind of behavior tends to get you CAPTCHAed for life) [05:43] Well I know of a software on Windows that's pretty good at it... [05:46] cool, what is it? [05:52] http://www.dvdvideosoft.com/products/dvd/Free-YouTube-Download.htm [05:52] It's ad-supported and only works on Windows but it's still good software. [05:54] (well it might work via WINE but I can't promise that) [05:57] thanks [05:58] I'll probably hack youtube-dl now until I have something terrible working [06:56] SketchCow, you have all of oqotalk now [06:59] Saw. [06:59] http://archive.org/details/archiveteam_oqotalkcom_2012_03_panic [07:49] good news [07:49] i can now scan though google search results [07:53] I think that last statement needs a little more detail [07:54] i'm making a index based on search results of files from cscope.us [07:54] search="site:cscope.us+filetype:pdf" [07:55] i think do a for i in $(seq 1 to 30); do [07:55] echo "http://google.com/search?q=$search&start=${i}0" >> index.txt [07:55] then something like this: [07:55] wget -x -i index.txt --user-agent="Firefox/3.0.15" --warc-file=google-cscope.us --warc-cdx -w 5 [07:55] then this: [07:55] zcat *.warc.gz | grep -ohP 'href='[^'>]+' | grep 'q=' | grep 'www.cscope.us' | sed 's|.*url?q=||g' | sed 's|&.*||g' | grep -v 'webcache.google' | grep http [07:56] you guys only have 55 pdfs in wayback so this should be something [08:20] i got the index [08:42] i'm uploading my grab of cscope pdfs right now [09:06] uploaded: https://archive.org/details/cscope.us-google-pdfs-grab-20130312 [09:31] youtube-dl has a bug that is breaking /user/ URLs, you have to move YoutubeUserIE above YoutubePlaylistIE [10:05] so looks like there are 71 inactive videos missing [10:05] in the 48000ids [11:20] http://blog.archive.org/2013/03/12/riding-with-the-bit-savers/ [15:09] good news [15:09] based on warc-proxy my warcs of the forums last month work just fine [16:10] alard: you've stopped showing the time left on the Available Projects page? [16:10] Or was it not ever there? [17:37] Ha ha, I jumped the chain of command posting that. [17:37] Small error, turns out setting a checkbox on the blog software automatically promotes to front page [17:38] yeah man you're drowning out vital bitcoin news [17:39] SketchCow: ;D [17:39] SketchCow: I like the "Movie showing template" that was posted a few days ago as well (not yours though) [17:56] this is video doesn't exist it looks like: http://archive.org/details/g4tv.com-video36800 [18:08] ok guys [18:09] the forum uploads from feb 2013 will have s= links [18:09] it only worked in warc viewer cause it was using cached data of forums.g4tv.com [19:43] We're out of the "woods" with space on FOS. [19:43] 5.7tb free, enough for the time being. [19:47] https://github.com/ludios/youtube-dl/commits/prime my youtube-dl experience is much better now [19:57] more ops plz [20:08] submit pull request, ivan` :) [20:09] yeah I could definitely use some of those changes [20:12] balrog_: these are some pretty low-quality diffs [20:12] the end result is good for me though [20:14] I am assuming that the filenames on the filesystem are a certain format [20:14] the blip.tv change doesn't fall back to non-Source when Source isn't available [20:14] the sleeping is sleeping thrice in a row for reasons unknown to me [21:18] hey, I've just downloaded the archiveteam-warrior after reading http://jacquesmattheij.com/come-help-save-posterous-from-oblivion# and it says to ask here before starting the posterous project - what do I need to know? [21:19] Come join us in #preposterus (Project specific channel for Posterous) [21:21] The warning/notice was put into the Warrior's project page because; Posterous might ban your IP. That will make you unable to browse any Posterous blogs/spaces. If you've read that, select it and go on :-) [21:22] I'm alright with that - all the ones I read have moved. Thanks for the heads up. [21:23] There's more activity in #preposterus btw, since that's project specific :) That's where all the project updates happen as well [21:24] many thanks, I'll try there. [21:34] If you're here regarding Jacques Mattheij's HackerNews post about Posterous, the project specific channel (Where everything interesting happens) is at #preposterus (on this very same IRC Network) [21:34] someone put that in the topic [21:35] Hmm, I guess I'll save the current one and put it back later [21:36] no I meant add it to the topic [21:37] I know. I'm doing it right now. Chill pill. [21:38] stupid topic charlimit [22:09] Does anyone know- Are there any other long-lived archives other than Arcive.org/Wayback that archive webpages in general? [22:12] uhm [22:13] well there's always google web cache IMO [22:13] although [22:13] it's not a permanent archive iirc [22:13] nor it's an archive at all [22:14] you have webcitation.org [22:14] too [22:14] it's for references more than archiving [22:14] but it's decent [22:15] ersi, [22:15] Heya [22:15] have you spoken to webcitation.org staff [22:15] they seem to be a little troubled [22:15] About? Posterous? [22:15] re: https://fundrazr.com/campaigns/aQMp7 [22:15] Oh [22:15] no [22:15] about themselves [22:16] No, I havn't. [22:16] I'm mostly interested in ones that crawl automatically; there was an incident where an... unscrupulous company decided to upload large numbers of sensitive documents to a site; these need to be removed from as many places as possible. [22:16] Looks like they're dying :-/ [22:16] CoJaBo: There's Common Crawl. [22:16] It's not an archive per say, but it effectively is. [22:16] someone should contact 'em [22:16] and ask for a backup [22:16] .torrent or something [22:17] Calm down with that enter button ;) [22:17] btw yahoo message boards is shutting down in about half a month [22:17] anyone doing anything about that??? [22:17] There's #BurnTheMessenger and there's a project page on ArchiveTeam wiki [22:17] :p sorry, this bad habit is really old [22:17] started since I started IRCing at DALnet [22:17] bad habits never die [22:18] Please keep traffic in this channel to a low. #archiveteam-bs is for freefloat chat and then there's the project channels. [22:19] ersi: huh.. is there any way to remove data from there? [22:19] CoJaBo, what files did they upload, just wondering [22:19] Andres_: Everything they had access to [22:20] It was one of those outsourcing web development companies [22:20] CoJaBo: From Common Crawl? [22:20] ersi: Yeh, for starters.. [22:25] CoJaBo: I'd try contacting them and asking nicely. [22:25] I'd consider leaving it there though. In my opinion, if something's been public, let it be public (ish) [22:29] ersi: Yeh, the customers prolly wouldn't appreciate showing up there tho lol.. [22:31] CoJaBo: Ah, aight. Well, a friendly nod should work. [22:32] Is there a way to search their data to see if its even there? Or do you need amazon or whatever to do that.. [22:32] Itd def. on Archive.org tho >_> [22:32] Hell, wtf..... Someone got c99shell on there too, niiiiiiiiccccccceeeeeeeeeeeeeee <_< [22:33] If it's in the Internet Archive Wayback machine, contact them on info@archive.org and they'll help you out. Please keep in mind that they're not many people and it can take a little while. [22:34] I think I can just do the Robots.txt thing; tho I guess that way is permanent isnt it.. [22:35] Yes, you can exclude it by adding a robots.txt to the domains that are effected. [22:35] Ah hell, thats right, they uploaded it too their own site too :/ [22:35] The wayback machine will poll the current domain/robots.txt before showing something from the wayback archives. If it excludes, it won't show it. [22:35] Yeah, just contact the archive and they'll probably help you out. No biggie. [22:36] Yeh..... Advice- NEVER hire outsourcing companies LOL........... [22:37] Yeah, heh. [22:37] == If you're here regarding Jacques Mattheij's HackerNews post about Posterous, the project specific channel (Where everything interesting happens) is at #preposterus (on this very same IRC Network) == [22:38] (Saw a bunch of new clients join up)