#archiveteam 2013-06-27,Thu

↑back Search

Time	Nickname	Message
02:15 ^🔗	SketchCow	I think old ringtones are an excellent situation
02:20 ^🔗	SketchCow	The idea with archive.org usage is to put piles of items together, and then a curated version can be had.
06:19 ^🔗	menacespb	I now get the "we don't want to overload the site" message on five out of 6 threads for xanga. I really miss a distributed mode in the warrior, so that the "warrior decides" mode would just distribute efforts evenly across all the active projects, so it didn't need to just sit idle waiting for slots to appear.
11:31 ^🔗	Deewiant	Please be advised that on June 30th 2013, we will be updating our Content Policy to strictly prohibit the monetization of Adult content on Blogger. After June 30th 2013, we will be enforcing this policy and will remove blogs which are adult in nature and are displaying advertisements to adult websites.
11:50 ^🔗	ersi	The fuck
11:54 ^🔗	Deewiant	Sent sometime in the past 14 hours (as far as I can tell) to Blogger-hosted blogs containing "Adult content"
11:54 ^🔗	ersi	So, which Adult content-blog do you have? smiles
11:55 ^🔗	Deewiant	I don't, I just happened to read the front page of one today and saw that
11:56 ^🔗	Deewiant	And Googled it and found it to be all over the place so it seems legit
11:58 ^🔗	ersi	Aw, disappointed
12:00 ^🔗	Deewiant	And I figured I should post it here right away, due to the sweet time limit
12:04 ^🔗	winr4r	that's kind of short notice
12:10 ^🔗	SmileyG	a month?
12:10 ^🔗	SmileyG	it's more than somet htings
12:11 ^🔗	Deewiant	June, not July
12:12 ^🔗	SmileyG	http://www.itworld.com/security/362522/buy-matthew-broderick-s-old-movie-computer-possibly-impress-ally-sheedy
12:12 ^🔗	SmileyG	I feel we should have this :D
12:16 ^🔗	winr4r	SmileyG: i make it three days
12:19 ^🔗	SmileyG	Wait, June
12:19 ^🔗	SmileyG	herp.
12:41 ^🔗	SketchCow	----------------------------------------------------
12:41 ^🔗	SketchCow	The Google Porn Blog Deletion is real, and we need to do something about it.
12:41 ^🔗	SketchCow	----------------------------------------------------
12:50 ^🔗	ersi	short related note: We have at least 4,179,274 blogspot.com blog-names from the Google Reader grab project (Might be more, if ivan` has found more somewhere).
12:50 ^🔗	ersi	Is blogspot and blogger blogs the same kind of deal? I assume the policy will be the same
13:01 ^🔗	winr4r	ersi: blogspot.com = blogger
13:02 ^🔗	ersi	Good.
13:02 ^🔗	ersi	So finding more Blogger/Blotspot URLs is a top priority
13:13 ^🔗	SketchCow	Yes
13:38 ^🔗	ivan`	I can give you all my blogspot domains
13:38 ^🔗	ivan`	I have way more than 4M
13:39 ^🔗	ersi	ivan`: That'd be great, feel free to throw it up somewhere when you got time. I guess extraction may take a little while
13:39 ^🔗	SmileyG	omf_: got any?
13:42 ^🔗	ivan`	June 30th? nice advance warning there
13:42 ^🔗	SketchCow	Please, we need to get moving on it
13:45 ^🔗	ersi	Cogs have started to grind into rollin'
13:46 ^🔗	TrojanEel	It's not a bit change of policy though, is it? they still allow adult content, and as far as 2007 their content policy prohibits having a 'significant number of ads/links to commercial sites'. The change is that they now completely ban such ads.
13:46 ^🔗	Baljem	no, the change is that they say they're going to delete blogs with such ads
13:47 ^🔗	Baljem	although it doesn't say whether they're doing it proactively or reactively, hmm
13:47 ^🔗	TrojanEel	the change is that they say they will delete blogs that don't follow the terms of use?
13:47 ^🔗	ivan`	blogspot is much bigger than anyone imagines
13:47 ^🔗	*	ivan` starts exporting
13:51 ^🔗	SmileyG	lots and lots and lots of spam on there...
13:53 ^🔗	SmileyG	So.... #pornspot ?
13:56 ^🔗	ivan`	2,580,425 blogspot subdomains so far up to ciadosgansos.blogspot.com
14:06 ^🔗	Baljem	how are we planning on identifying which ones need grabbing? (it does not seem particularly feasible to grab everything in three days...)
14:08 ^🔗	ivan`	I'm sure the text can be grabbed in three days if one really tries
14:09 ^🔗	ivan`	I have a tracker and upload target ready if someone wants to write the pipeline
14:11 ^🔗	SmileyG	I can provide GLaDOS's box as upload target too
14:11 ^🔗	SmileyG	and we can poke UnrealInc
14:11 ^🔗	SmileyG	errr underscor
14:28 ^🔗	ivan`	uploading 13M blogspot subdomains
14:29 ^🔗	ersi	ivan`: Thanks!
14:42 ^🔗	ivan`	https://ludios.org/tmp/blogspot.com-subdomains.txt.bz2
14:42 ^🔗	ivan`	there's a lot of spam in there and most of them are extracted from hrefs and hence not verified in any way
14:49 ^🔗	ersi	Way better than nothing
15:35 ^🔗	omf_	SmileyG, mine are unique domain names, no sub-domains to speak of
15:46 ^🔗	SmileyG	Ah k
21:24 ^🔗	underscor	hmm, how do we easily know whether a blog is "monetizing adult content"?
21:25 ^🔗	ivan`	start with the longest subdomains first ;)
21:28 ^🔗	underscor	haha
21:28 ^🔗	underscor	also ones that contain "porn", "sex", "adult", "fuck", etc
21:40 ^🔗	Baljem	the best I could come up with is "load page w/o any cookies, if you don't get the 'adult content' warning page it's not adult so ignore it; if you do, do the appropriate request to accept the warning and look for any ad brokers in the resulting blog page"
21:40 ^🔗	Baljem	but the last part of that may be tricky. not sure.
21:41 ^🔗	Baljem	and presumes I remember Blogger correctly; I seem to recall a "here be dragons" page, possibly with an orange "yeah, let me at the porns" button
21:41 ^🔗	underscor	hahaha
21:42 ^🔗	underscor	I now wish that was the actual UI pathway
21:57 ^🔗	graysparr	Hello! Anyone willing to help with an error with un-megawarc-ing a megawarc into it's warc's? (AKA How many warcs can my megawarc warc if my megawarc could warc into warcs)
22:00 ^🔗	ivan`	1) why do you want to do that 2) what's the error?
22:02 ^🔗	graysparr	1 - I need to in order to access the archive, right? 2 - File "megawarc", line 128, in copy_to_stream raise Exception("End of file: %d bytes expected, but %d bytes read." % (buf_size, l)) Exception: End of file: 4096 bytes expected, but 236 bytes read.
22:02 ^🔗	ivan`	1) no, you can use the .cdx to seek to any part of the megawarc
22:03 ^🔗	ivan`	alard might know what's up with the error
22:03 ^🔗	ivan`	you sure you got the whole megawarc?
22:04 ^🔗	graysparr	Pretty sure. I got all the files from http://archive.org/details/archiveteam-fanfiction-warc-11
22:04 ^🔗	graysparr	How would I use the .cdx to seek parts of it? Is there any type of guide anywhere?
22:06 ^🔗	ivan`	I was looking at the WARC ISO spec but it doesn't actually specify .cdx
22:10 ^🔗	ivan`	https://archive.org/web/researcher/cdx_file_format.php https://archive.org/web/researcher/cdx_legend.php
22:10 ^🔗	ivan`	http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem there might be a tool somewhere to extract a particular file
22:13 ^🔗	graysparr	I've been banging my head against this for a couple weeks now. I've got apache tomcat and i've tried both warcmanager and the wayback archive, neither recognizes the megawarc as a warc. I thought I needed to unmegawarc it back into its smaller warcs in order to get either to work.
22:14 ^🔗	ivan`	a megawarc is just a bunch of concatenated warcs
22:15 ^🔗	ivan`	something that reads a warc can read a megawarc (assuming it does not get confused by repeated metadata? not sure)
22:19 ^🔗	graysparr	Well, how do you access megawarc's?
22:20 ^🔗	ivan`	I don't, I just make terabytes of them ;)
22:20 ^🔗	ivan`	I use zless to inspect them
22:20 ^🔗	ivan`	someone else here may have better ideas
22:21 ^🔗	graysparr	Hopefully. Thank you though! :)
22:40 ^🔗	*	graysparr sits and waits and prays for someone that can help.
22:44 ^🔗	ivan`	indeed, stick around and maybe grab a real IRC client
22:45 ^🔗	ivan`	it's possible that megawarc needs to be repaired if it was created before megawarc started checking for gzip validity
22:46 ^🔗	graysparr	Sorry, used to use mIRC years and years ago. Don't have a need for a client except for this one problem.
22:50 ^🔗	graysparr	Any pointers on how, if possible, that repair could be done?
23:05 ^🔗	graysparr	There. Just for you ivan` I got 'a real IRC client'. :)
23:25 ^🔗	*	graysparr sighs
23:26 ^🔗	ivan`	supposedly megawarc/megawarc-fix can fix a megawarc
23:38 ^🔗	ivan`	I've looked a few tools but they don't appear to use the .cdx file to jump to what you need
23:38 ^🔗	graysparr	well that found one invalid warc and removed it from the megawarc, but I still get the "Exception: End of file: 4096 bytes expected, but 236 bytes read." message when trying to unmegawarc it.
23:39 ^🔗	*	ivan` looks
23:40 ^🔗	*	graysparr praises your name and awaits with bated breath
23:41 ^🔗	ivan`	I was just checking if the code is not complete nonsense, and it does not appear to be
23:45 ^🔗	graysparr	well poo.
23:46 ^🔗	*	graysparr goes back to sitting and waiting.

irclogger-viewer