#archiveteam 2013-06-27,Thu

↑back Search

Time Nickname Message
02:15 🔗 SketchCow I think old ringtones are an excellent situation
02:20 🔗 SketchCow The idea with archive.org usage is to put piles of items together, and then a curated version can be had.
06:19 🔗 menacespb I now get the "we don't want to overload the site" message on five out of 6 threads for xanga. I really miss a distributed mode in the warrior, so that the "warrior decides" mode would just distribute efforts evenly across all the active projects, so it didn't need to just sit idle waiting for slots to appear.
11:31 🔗 Deewiant Please be advised that on June 30th 2013, we will be updating our Content Policy to strictly prohibit the monetization of Adult content on Blogger. After June 30th 2013, we will be enforcing this policy and will remove blogs which are adult in nature and are displaying advertisements to adult websites.
11:50 🔗 ersi The fuck
11:54 🔗 Deewiant Sent sometime in the past 14 hours (as far as I can tell) to Blogger-hosted blogs containing "Adult content"
11:54 🔗 ersi So, which Adult content-blog do you have? *smiles*
11:55 🔗 Deewiant I don't, I just happened to read the front page of one today and saw that
11:56 🔗 Deewiant And Googled it and found it to be all over the place so it seems legit
11:58 🔗 ersi Aw, disappointed
12:00 🔗 Deewiant And I figured I should post it here right away, due to the sweet time limit
12:04 🔗 winr4r that's kind of short notice
12:10 🔗 SmileyG a month?
12:10 🔗 SmileyG it's more than somet htings
12:11 🔗 Deewiant June, not July
12:12 🔗 SmileyG http://www.itworld.com/security/362522/buy-matthew-broderick-s-old-movie-computer-possibly-impress-ally-sheedy
12:12 🔗 SmileyG I feel we should have this :D
12:16 🔗 winr4r SmileyG: i make it three days
12:19 🔗 SmileyG Wait, June
12:19 🔗 SmileyG herp.
12:41 🔗 SketchCow ----------------------------------------------------
12:41 🔗 SketchCow The Google Porn Blog Deletion is real, and we need to do something about it.
12:41 🔗 SketchCow ----------------------------------------------------
12:50 🔗 ersi short related note: We have at least 4,179,274 blogspot.com blog-names from the Google Reader grab project (Might be more, if ivan` has found more somewhere).
12:50 🔗 ersi Is blogspot and blogger blogs the same kind of deal? I assume the policy will be the same
13:01 🔗 winr4r ersi: blogspot.com = blogger
13:02 🔗 ersi Good.
13:02 🔗 ersi So finding more Blogger/Blotspot URLs is a top priority
13:13 🔗 SketchCow Yes
13:38 🔗 ivan` I can give you all my blogspot domains
13:38 🔗 ivan` I have way more than 4M
13:39 🔗 ersi ivan`: That'd be great, feel free to throw it up somewhere when you got time. I guess extraction may take a little while
13:39 🔗 SmileyG omf_: got any?
13:42 🔗 ivan` June 30th? nice advance warning there
13:42 🔗 SketchCow Please, we need to get moving on it
13:45 🔗 ersi Cogs have started to grind into rollin'
13:46 🔗 TrojanEel It's not a bit change of policy though, is it? they still allow adult content, and as far as 2007 their content policy prohibits having a 'significant number of ads/links to commercial sites'. The change is that they now completely ban such ads.
13:46 🔗 Baljem no, the change is that they say they're going to delete blogs with such ads
13:47 🔗 Baljem although it doesn't say whether they're doing it proactively or reactively, hmm
13:47 🔗 TrojanEel the change is that they say they will delete blogs that don't follow the terms of use?
13:47 🔗 ivan` blogspot is much bigger than anyone imagines
13:47 🔗 * ivan` starts exporting
13:51 🔗 SmileyG lots and lots and lots of spam on there...
13:53 🔗 SmileyG So.... #pornspot ?
13:56 🔗 ivan` 2,580,425 blogspot subdomains so far up to ciadosgansos.blogspot.com
14:06 🔗 Baljem how are we planning on identifying which ones need grabbing? (it does not seem particularly feasible to grab everything in three days...)
14:08 🔗 ivan` I'm sure the text can be grabbed in three days if one really tries
14:09 🔗 ivan` I have a tracker and upload target ready if someone wants to write the pipeline
14:11 🔗 SmileyG I can provide GLaDOS's box as upload target too
14:11 🔗 SmileyG and we can poke UnrealInc
14:11 🔗 SmileyG errr underscor
14:28 🔗 ivan` uploading 13M blogspot subdomains
14:29 🔗 ersi ivan`: Thanks!
14:42 🔗 ivan` https://ludios.org/tmp/blogspot.com-subdomains.txt.bz2
14:42 🔗 ivan` there's a lot of spam in there and most of them are extracted from hrefs and hence not verified in any way
14:49 🔗 ersi Way better than nothing
15:35 🔗 omf_ SmileyG, mine are unique domain names, no sub-domains to speak of
15:46 🔗 SmileyG Ah k
21:24 🔗 underscor hmm, how do we easily know whether a blog is "monetizing adult content"?
21:25 🔗 ivan` start with the longest subdomains first ;)
21:28 🔗 underscor haha
21:28 🔗 underscor also ones that contain "porn", "sex", "adult", "fuck", etc
21:40 🔗 Baljem the best I could come up with is "load page w/o any cookies, if you don't get the 'adult content' warning page it's not adult so ignore it; if you do, do the appropriate request to accept the warning and look for any ad brokers in the resulting blog page"
21:40 🔗 Baljem but the last part of that may be tricky. not sure.
21:41 🔗 Baljem and presumes I remember Blogger correctly; I seem to recall a "here be dragons" page, possibly with an orange "yeah, let me at the porns" button
21:41 🔗 underscor hahaha
21:42 🔗 underscor I now wish that was the actual UI pathway
21:57 🔗 graysparr Hello! Anyone willing to help with an error with un-megawarc-ing a megawarc into it's warc's? (AKA How many warcs can my megawarc warc if my megawarc could warc into warcs)
22:00 🔗 ivan` 1) why do you want to do that 2) what's the error?
22:02 🔗 graysparr 1 - I need to in order to access the archive, right? 2 - File "megawarc", line 128, in copy_to_stream raise Exception("End of file: %d bytes expected, but %d bytes read." % (buf_size, l)) Exception: End of file: 4096 bytes expected, but 236 bytes read.
22:02 🔗 ivan` 1) no, you can use the .cdx to seek to any part of the megawarc
22:03 🔗 ivan` alard might know what's up with the error
22:03 🔗 ivan` you sure you got the whole megawarc?
22:04 🔗 graysparr Pretty sure. I got all the files from http://archive.org/details/archiveteam-fanfiction-warc-11
22:04 🔗 graysparr How would I use the .cdx to seek parts of it? Is there any type of guide anywhere?
22:06 🔗 ivan` I was looking at the WARC ISO spec but it doesn't actually specify .cdx
22:10 🔗 ivan` https://archive.org/web/researcher/cdx_file_format.php https://archive.org/web/researcher/cdx_legend.php
22:10 🔗 ivan` http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem there might be a tool somewhere to extract a particular file
22:13 🔗 graysparr I've been banging my head against this for a couple weeks now. I've got apache tomcat and i've tried both warcmanager and the wayback archive, neither recognizes the megawarc as a warc. I thought I needed to unmegawarc it back into its smaller warcs in order to get either to work.
22:14 🔗 ivan` a megawarc is just a bunch of concatenated warcs
22:15 🔗 ivan` something that reads a warc can read a megawarc (assuming it does not get confused by repeated metadata? not sure)
22:19 🔗 graysparr Well, how do you access megawarc's?
22:20 🔗 ivan` I don't, I just make terabytes of them ;)
22:20 🔗 ivan` I use zless to inspect them
22:20 🔗 ivan` someone else here may have better ideas
22:21 🔗 graysparr Hopefully. Thank you though! :)
22:40 🔗 * graysparr sits and waits and prays for someone that can help.
22:44 🔗 ivan` indeed, stick around and maybe grab a real IRC client
22:45 🔗 ivan` it's possible that megawarc needs to be repaired if it was created before megawarc started checking for gzip validity
22:46 🔗 graysparr Sorry, used to use mIRC years and years ago. Don't have a need for a client except for this one problem.
22:50 🔗 graysparr Any pointers on how, if possible, that repair could be done?
23:05 🔗 graysparr There. Just for you ivan` I got 'a real IRC client'. :)
23:25 🔗 * graysparr sighs
23:26 🔗 ivan` supposedly megawarc/megawarc-fix can fix a megawarc
23:38 🔗 ivan` I've looked a few tools but they don't appear to use the .cdx file to jump to what you need
23:38 🔗 graysparr well that found one invalid warc and removed it from the megawarc, but I still get the "Exception: End of file: 4096 bytes expected, but 236 bytes read." message when trying to unmegawarc it.
23:39 🔗 * ivan` looks
23:40 🔗 * graysparr praises your name and awaits with bated breath
23:41 🔗 ivan` I was just checking if the code is not complete nonsense, and it does not appear to be
23:45 🔗 graysparr well poo.
23:46 🔗 * graysparr goes back to sitting and waiting.

irclogger-viewer