[02:15] I think old ringtones are an excellent situation [02:20] The idea with archive.org usage is to put piles of items together, and then a curated version can be had. [06:19] I now get the "we don't want to overload the site" message on five out of 6 threads for xanga. I really miss a distributed mode in the warrior, so that the "warrior decides" mode would just distribute efforts evenly across all the active projects, so it didn't need to just sit idle waiting for slots to appear. [11:31] Please be advised that on June 30th 2013, we will be updating our Content Policy to strictly prohibit the monetization of Adult content on Blogger. After June 30th 2013, we will be enforcing this policy and will remove blogs which are adult in nature and are displaying advertisements to adult websites. [11:50] The fuck [11:54] Sent sometime in the past 14 hours (as far as I can tell) to Blogger-hosted blogs containing "Adult content" [11:54] So, which Adult content-blog do you have? *smiles* [11:55] I don't, I just happened to read the front page of one today and saw that [11:56] And Googled it and found it to be all over the place so it seems legit [11:58] Aw, disappointed [12:00] And I figured I should post it here right away, due to the sweet time limit [12:04] that's kind of short notice [12:10] a month? [12:10] it's more than somet htings [12:11] June, not July [12:12] http://www.itworld.com/security/362522/buy-matthew-broderick-s-old-movie-computer-possibly-impress-ally-sheedy [12:12] I feel we should have this :D [12:16] SmileyG: i make it three days [12:19] Wait, June [12:19] herp. [12:41] ---------------------------------------------------- [12:41] The Google Porn Blog Deletion is real, and we need to do something about it. [12:41] ---------------------------------------------------- [12:50] short related note: We have at least 4,179,274 blogspot.com blog-names from the Google Reader grab project (Might be more, if ivan` has found more somewhere). [12:50] Is blogspot and blogger blogs the same kind of deal? I assume the policy will be the same [13:01] ersi: blogspot.com = blogger [13:02] Good. [13:02] So finding more Blogger/Blotspot URLs is a top priority [13:13] Yes [13:38] I can give you all my blogspot domains [13:38] I have way more than 4M [13:39] ivan`: That'd be great, feel free to throw it up somewhere when you got time. I guess extraction may take a little while [13:39] omf_: got any? [13:42] June 30th? nice advance warning there [13:42] Please, we need to get moving on it [13:45] Cogs have started to grind into rollin' [13:46] It's not a bit change of policy though, is it? they still allow adult content, and as far as 2007 their content policy prohibits having a 'significant number of ads/links to commercial sites'. The change is that they now completely ban such ads. [13:46] no, the change is that they say they're going to delete blogs with such ads [13:47] although it doesn't say whether they're doing it proactively or reactively, hmm [13:47] the change is that they say they will delete blogs that don't follow the terms of use? [13:47] blogspot is much bigger than anyone imagines [13:47] * ivan` starts exporting [13:51] lots and lots and lots of spam on there... [13:53] So.... #pornspot ? [13:56] 2,580,425 blogspot subdomains so far up to ciadosgansos.blogspot.com [14:06] how are we planning on identifying which ones need grabbing? (it does not seem particularly feasible to grab everything in three days...) [14:08] I'm sure the text can be grabbed in three days if one really tries [14:09] I have a tracker and upload target ready if someone wants to write the pipeline [14:11] I can provide GLaDOS's box as upload target too [14:11] and we can poke UnrealInc [14:11] errr underscor [14:28] uploading 13M blogspot subdomains [14:29] ivan`: Thanks! [14:42] https://ludios.org/tmp/blogspot.com-subdomains.txt.bz2 [14:42] there's a lot of spam in there and most of them are extracted from hrefs and hence not verified in any way [14:49] Way better than nothing [15:35] SmileyG, mine are unique domain names, no sub-domains to speak of [15:46] Ah k [21:24] hmm, how do we easily know whether a blog is "monetizing adult content"? [21:25] start with the longest subdomains first ;) [21:28] haha [21:28] also ones that contain "porn", "sex", "adult", "fuck", etc [21:40] the best I could come up with is "load page w/o any cookies, if you don't get the 'adult content' warning page it's not adult so ignore it; if you do, do the appropriate request to accept the warning and look for any ad brokers in the resulting blog page" [21:40] but the last part of that may be tricky. not sure. [21:41] and presumes I remember Blogger correctly; I seem to recall a "here be dragons" page, possibly with an orange "yeah, let me at the porns" button [21:41] hahaha [21:42] I now wish that was the actual UI pathway [21:57] Hello! Anyone willing to help with an error with un-megawarc-ing a megawarc into it's warc's? (AKA How many warcs can my megawarc warc if my megawarc could warc into warcs) [22:00] 1) why do you want to do that 2) what's the error? [22:02] 1 - I need to in order to access the archive, right? 2 - File "megawarc", line 128, in copy_to_stream raise Exception("End of file: %d bytes expected, but %d bytes read." % (buf_size, l)) Exception: End of file: 4096 bytes expected, but 236 bytes read. [22:02] 1) no, you can use the .cdx to seek to any part of the megawarc [22:03] alard might know what's up with the error [22:03] you sure you got the whole megawarc? [22:04] Pretty sure. I got all the files from http://archive.org/details/archiveteam-fanfiction-warc-11 [22:04] How would I use the .cdx to seek parts of it? Is there any type of guide anywhere? [22:06] I was looking at the WARC ISO spec but it doesn't actually specify .cdx [22:10] https://archive.org/web/researcher/cdx_file_format.php https://archive.org/web/researcher/cdx_legend.php [22:10] http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem there might be a tool somewhere to extract a particular file [22:13] I've been banging my head against this for a couple weeks now. I've got apache tomcat and i've tried both warcmanager and the wayback archive, neither recognizes the megawarc as a warc. I thought I needed to unmegawarc it back into its smaller warcs in order to get either to work. [22:14] a megawarc is just a bunch of concatenated warcs [22:15] something that reads a warc can read a megawarc (assuming it does not get confused by repeated metadata? not sure) [22:19] Well, how do you access megawarc's? [22:20] I don't, I just make terabytes of them ;) [22:20] I use zless to inspect them [22:20] someone else here may have better ideas [22:21] Hopefully. Thank you though! :) [22:40] * graysparr sits and waits and prays for someone that can help. [22:44] indeed, stick around and maybe grab a real IRC client [22:45] it's possible that megawarc needs to be repaired if it was created before megawarc started checking for gzip validity [22:46] Sorry, used to use mIRC years and years ago. Don't have a need for a client except for this one problem. [22:50] Any pointers on how, if possible, that repair could be done? [23:05] There. Just for you ivan` I got 'a real IRC client'. :) [23:25] * graysparr sighs [23:26] supposedly megawarc/megawarc-fix can fix a megawarc [23:38] I've looked a few tools but they don't appear to use the .cdx file to jump to what you need [23:38] well that found one invalid warc and removed it from the megawarc, but I still get the "Exception: End of file: 4096 bytes expected, but 236 bytes read." message when trying to unmegawarc it. [23:39] * ivan` looks [23:40] * graysparr praises your name and awaits with bated breath [23:41] I was just checking if the code is not complete nonsense, and it does not appear to be [23:45] well poo. [23:46] * graysparr goes back to sitting and waiting.