#archiveteam 2012-10-09,Tue

↑back Search

Time Nickname Message
05:09 🔗 SketchCow http://archive.org/details/webshots-freeze-frame
05:19 🔗 DFJustin parking some more cars in the IA driveway eh
05:57 🔗 godane SketchCow: I'm uploading 132 iso of linux format
05:58 🔗 godane using your naming for the isos so there standard and easy to find
14:38 🔗 SmileyG Jason, can you pull mny new version of the front page plz, http://archiveteam.org/index.php?title=Djsmiley2k/main_page
14:39 🔗 SmileyG And possibly tweet to followers about Warrior? - ( I tweeted Oi followers - webshots are going to delete all member photos - were backing them up - HELP http://archiveteam.org/index.php?title=ArchiveTeam_Warrior & Download, VM - PLZ RT ) I'm sure you can come up with something more witty.
17:13 🔗 Zym_ anyone can tell me how the archiveteam warrior works? I'm not sure if mines doing anything productive...
17:14 🔗 soultcer Zym_: It has a webinterface, try http://localhost:8001/
17:18 🔗 Zym_ got that. I'm still not sure what happens under the surface because all i get is a gui but no output
18:43 🔗 SketchCow http://bt.custhelp.com/app/answers/detail/a_id/39105/~/the-free-bt-web-hosting-service-is-closing
18:43 🔗 SketchCow IMMEDIATELY.
18:47 🔗 joepie91 uhoh
18:47 🔗 C-Keen To access our FTP site using a web browser, type the following into your browser:
18:47 🔗 C-Keen ftp://username:password@ftp.btinternet.com
18:47 🔗 C-Keen Once you're logged in you'll see an FTP listing of directories.
18:47 🔗 C-Keen OREALLY?
18:47 🔗 C-Keen FTP listing of directories
18:47 🔗 C-Keen You may have to select the "pub" directory to view your files. Once you have found your files, right click on the ones you want to save, then choose "Save Target as".
18:57 🔗 soultcer How do we usually archive such small webhosting providers? wget-warc? heritrix?
19:06 🔗 SmileyG root:*:0:1:Operator:/:
19:06 🔗 SmileyG bin:*:2:2::/:/bin/csh
19:06 🔗 SmileyG daemon:*:1:1::/:
19:06 🔗 SmileyG ftp:*:99:99:ftp user:/var/ftp-anon:nosuchshell
19:06 🔗 SmileyG sys:*:3:3::/bin:
19:06 🔗 SmileyG nobody:*:65534:65534::/:
19:06 🔗 SmileyG lulz?
19:06 🔗 C-Keen that's the ftp server?
19:06 🔗 SmileyG yes.
19:06 🔗 C-Keen lolwut?
19:07 🔗 SmileyG they have hiden directory listings but not actually blocked it
19:07 🔗 C-Keen should make backup easier :p
19:07 🔗 SmileyG its a chroot though I think.
19:10 🔗 SmileyG https://www.google.co.uk/#q=site:btinternet.com&hl=en&safe=off&prmd=imvns&ei=cnV0ULvMKcvK0AW1_oDwDg&start=10&sa=N&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.&fp=70cfab5615ed83a6&biw=888&bih=625
19:10 🔗 SmileyG plenty of sites :S
19:11 🔗 C-Keen indeed
19:11 🔗 C-Keen how can we get a list of them all?
19:13 🔗 C-Keen http://www.btinternet.com/~memoriesofmax/ <-
19:13 🔗 C-Keen that's what we do it for...
19:17 🔗 SketchCow Really?
19:17 🔗 SketchCow I do it for http://www.btinternet.com/~carol.stirling2/New_Folder/whips_and_crops.htm
19:17 🔗 C-Keen kinky
19:18 🔗 SketchCow http://www.btinternet.com/~shawater/warning.html
19:18 🔗 C-Keen this looks like a tiny geocities
19:18 🔗 SketchCow It IS a tiny Geocities.
19:19 🔗 SmileyG SketchCow: thats "Does it for me" ;)
19:19 🔗 SmileyG DO we have anything to rip google results yet?
19:19 🔗 alard We could put it on the warrior.
19:19 🔗 alard SmileyG: https://gist.github.com/2788197d2db2779cb7b0
19:19 🔗 SmileyG ipv6? :S
19:20 🔗 alard Is faster.
19:20 🔗 SmileyG not if I don't already have IPv6 setup :S
19:20 🔗 alard No. Do you want to be blocked by Google?
19:20 🔗 SmileyG For how long :S
19:21 🔗 alard Well, the trick is: if you use ipv6 you can query at a much higher rate.
19:21 🔗 SmileyG D:
19:21 🔗 C-Keen I can do it and hand out results...
19:21 🔗 SmileyG Looks like theres only 10 pages of google results before they turn into random pages on the sites...
19:21 🔗 C-Keen pestilenz.org has got a native connection and not really a bandwidth limit as our hoster does not have a billing plan for v6 in place
19:22 🔗 alard C-Keen: You do need a ipv6-over-ipv4 tunnel from tunnelbroker.net.
19:22 🔗 C-Keen alard: aawwww
19:22 🔗 alard Because you need a /48 subnet.
19:23 🔗 C-Keen it's only a /64
19:23 🔗 joepie91 <Migs>LEGAL torrenting is legal
19:23 🔗 joepie91 <Migs>torrenting is not
19:23 🔗 joepie91 <joepie91>torrenting IS legal
19:23 🔗 joepie91 prepare to laugh
19:24 🔗 SmileyG lols
19:24 🔗 SmileyG WHere is this? I want to point and laugh
19:24 🔗 SmileyG car annology
19:24 🔗 SmileyG "Driving is legal"
19:24 🔗 SmileyG "Legal driving is legal" "Driving is not"
19:24 🔗 SmileyG :D
19:24 🔗 joepie91 SmileyG: #webtech on freenode
19:28 🔗 SmileyG Not worth wasting your time over joepie91
19:31 🔗 C-Keen hm, using the google search api limits you to 100 search results per day. how dumb is that
19:31 🔗 C-Keen has anyone tried and actually *asked* btinternet for a list of all the sites?
19:34 🔗 SmileyG so I'm setting up ipv6 tunnel for future usage :<
19:35 🔗 joepie91 seriously
19:35 🔗 joepie91 these people are sad
19:35 🔗 SmileyG joepie91: haters, don't waste time on them
19:35 🔗 joepie91 no, not haters
19:35 🔗 joepie91 just straight out idiots
19:38 🔗 * SketchCow is desperately trying to upload everything off FOS while this insane webshots thing is going down.
19:38 🔗 SketchCow Just in case you're wondering what has my attention.
19:39 🔗 SmileyG http://www.btinternet.com/new/content/custhome/listings/A.shtml
19:39 🔗 SmileyG BINGO
19:39 🔗 SmileyG This isn't all sites
19:39 🔗 SmileyG but a large chunjk :D
19:40 🔗 SmileyG NB Sites are not listed automatically however the listing submission process is currently unavailable while the system is being redesigned. Apologies for the inconvenience. << not sure whath appened there.
19:41 🔗 soultcer I bet they actually add all sites manually
19:41 🔗 SmileyG althought _a lot_ of sites are dead
19:41 🔗 C-Keen the first two entries on that site are broken
19:42 🔗 C-Keen maybe it is already out of date
19:42 🔗 SmileyG http://www.accordions.btinternet.co.uk/ yeaaaah
19:42 🔗 C-Keen it seems to always redirect me to that yahoo login page
19:42 🔗 SmileyG yeah the last one works :S
19:43 🔗 soultcer Why is it anytime something about deleting data comes up, it's Yahoo that is the culprit?
19:43 🔗 primus yahoo is starting to be annoying, do these guys actually keep anything they buy?
19:43 🔗 SmileyG lol
19:43 🔗 SmileyG "streamline the business"
19:43 🔗 SmileyG Remeber, they want to go 482mph.
19:44 🔗 primus maybe redbull can sponsor the balloon for them too ....
19:45 🔗 C-Keen SmileyG: seems like that index is all the way broken
19:45 🔗 SmileyG C-Keen: the last link worked :(
19:45 🔗 SmileyG hense why I was celebrating, only to realise the rest didn't. :(
19:45 🔗 SmileyG Is there any index that doesn't ban you?
19:45 🔗 SmileyG duckduckgo for example?
19:46 🔗 C-Keen the duckduckgo folks might even contribute the stuff in a sane way
19:46 🔗 SmileyG prepacked list?
19:47 🔗 C-Keen for example!
19:47 🔗 * SmileyG asks some guys who love ddg.
19:49 🔗 joepie91 god
19:49 🔗 joepie91 SmileyG: these guys are complete retards, seriously... I give them a techdirt link after they ask me for a legal source saying that copyright infringement =/= theft
19:49 🔗 joepie91 and they are so busy bitching about me and stroking their own ego that they don't even notice that the legal document itself is embedded on the page
19:50 🔗 SmileyG joepie91: ¬_¬ I deal with people like this all day. Not worth the hassle
19:50 🔗 joepie91 and keep claiming that I should 'show a legal document' and that 'techdirt is not a legal source'
19:50 🔗 SmileyG then again I get involved too
19:50 🔗 C-Keen SmileyG: I am asking on their "official IRC channel"
19:50 🔗 SmileyG some guy was telling me earlier how " " == ""
19:50 🔗 SmileyG i.e. space == null
19:51 🔗 C-Keen 21:51 < yano> duckduckgo doesn't provide more than the 0-click and the first result via it's API
19:51 🔗 C-Keen 21:51 < crazedpsyc> we don't and can't allow that due to licensing on the results
19:52 🔗 C-Keen SmileyG: ^
19:53 🔗 SmileyG ghay.
19:54 🔗 SmileyG they like to see themselves as rebels "ooo we block tracking"
19:54 🔗 C-Keen but then if you want to shove the results up theirs they all go "I am only thirteen!"
19:54 🔗 C-Keen or somethign
19:58 🔗 SmileyG lol;
20:04 🔗 Zym what about simple wget?
20:05 🔗 Zym and a picture of rob? http://www.asallen.btinternet.co.uk/Files/Rob.JPG
20:05 🔗 SketchCow Wget no way.
20:05 🔗 SketchCow We'll be doing WARCs.
20:05 🔗 SketchCow We want these in wayback.archive.org
20:06 🔗 SketchCow joepie, what the fuck are you arguing about, and why is it ot in -bs?
20:06 🔗 SketchCow Because it really should be there.
20:06 🔗 SketchCow Damn, Save Rob
20:06 🔗 SketchCow Because 10-1 Rob ain't making it another year
20:08 🔗 joepie91 SketchCow: not really an argument, more a quick remark :P
20:08 🔗 joepie91 but yeah, I'll take it to bs
20:08 🔗 alard Something to be aware of: www.btinternet.com/~$u == www.$u.btinternet.co.uk
20:08 🔗 SmileyG Oh hey its big rob.
20:09 🔗 SmileyG alard: urgh I should Of said that, it seemed obvious to me ¬_¬
20:09 🔗 alard Well, it means: 1. any searches (google etc.) for usernames have to be done for both types of urls, and 2. we need to decide which type to download.
20:10 🔗 joepie91 it gets more complicated:
20:10 🔗 joepie91 http://www.sam.gamble.btinternet.co.uk/languages/english/english.inc.php
20:10 🔗 alard Both versions, perhaps.
20:10 🔗 joepie91 note the sam.gamble
20:10 🔗 SmileyG yes
20:10 🔗 alard http://www.btinternet.com/~sam.gamble/
20:10 🔗 SketchCow My attitude is download both
20:10 🔗 SmileyG so btinternet.com/~sam.gamble?
20:11 🔗 SketchCow All of them.
20:11 🔗 SketchCow We'll keep them separate in dump
20:11 🔗 SketchCow Do some compares
20:12 🔗 joepie91 et presto!
20:12 🔗 joepie91 SketchCow, alard, SmileyG, http://www.btinternet.com/new/content/custhome/
20:12 🔗 joepie91 directory :)
20:12 🔗 joepie91 oh wait
20:12 🔗 joepie91 ah yes
20:12 🔗 joepie91 .com == .co.uk
20:13 🔗 joepie91 so that should have a listing
20:13 🔗 SmileyG [20:39:41] < SmileyG> http://www.btinternet.com/new/content/custhome/listings/A.shtml
20:13 🔗 SmileyG joepie91: those indexes are outta date it seems :(
20:13 🔗 joepie91 :(
20:14 🔗 alard This is a nice username: http://www.c.h.f.btinternet.co.uk/
20:14 🔗 joepie91 mmm
20:15 🔗 joepie91 SmileyG, what are you basing it on that it's outdated?
20:15 🔗 primus lots of them redirect to yahoo login
20:15 🔗 SmileyG the first 10 or so links didn't work for me :<
20:17 🔗 joepie91 mmm
20:17 🔗 Zym well alot redirect to yahoo login and alot just 404
20:17 🔗 Zym or rather 500
20:18 🔗 SketchCow I'd like this to be a warrior project, and I want us to fucking DEMOLISH this site
20:18 🔗 SketchCow This is a chance to get a geocities "right"
20:19 🔗 joepie91 mmmm
20:19 🔗 joepie91 anyone up for running a google crawler?
20:19 🔗 alard I'm doing a little bit of google crawling now, I'll stop later. 2200 usernames so far.
20:19 🔗 alard My ipv6 trick no longer works, so it's going slow.
20:19 🔗 joepie91 alard: and you're not getting banned?
20:19 🔗 joepie91 ipv6 trick..?
20:20 🔗 joepie91 also, in case it's useful, http://git.cryto.net/cgit/crytobooks/tree/crawler/crawler.py has some code that should work again with a few fixes
20:20 🔗 joepie91 to crawl google (as well as calibres)
20:20 🔗 joepie91 does pretty well in the not-getting-banned department :P
20:20 🔗 alard It used to be that with a /48 ipv6 subnet you could switch ip addresses between /64 subnets.
20:20 🔗 joepie91 ahh
20:20 🔗 alard So as long as you picked a different /64 subnet each time, you could keep on searching.
20:21 🔗 joepie91 hmm... alard: what if you spread the task between warriors?
20:21 🔗 joepie91 the googling
20:21 🔗 alard It would be nice if we could find this newsgroup: btinternet.homepages.announce
20:22 🔗 C-Keen probably on their nntp server
20:22 🔗 alard 1. I don't have a googling warrior task, no infra to handle the results. 2. You'd have a lot of unused time for the warrior.
20:22 🔗 joepie91 also, alard, another option although it may be a bit strange, is downloading database dumps from hacks etc, extracting btinternet e-mail addresses, and checking if a site exists for the username used for the email
20:22 🔗 joepie91 hmm
20:23 🔗 alard So to make this a warrior project, we'd need to: 1. figure out a good wget command that downloads everything for a user 2. make a warrior project (with rsync upload) 3. have a list of users
20:23 🔗 SmileyG wget -m with it limited to the ~user won't work?
20:24 🔗 alard Yes, something like wget -m.
20:25 🔗 C-Keen list of users is the main thing
20:26 🔗 alard and page-requisites, and something to prevent infinite recursion, and adjust-extension, and ... ?
20:27 🔗 alard SketchCow: Can you make a btinternet thing on fos?
20:27 🔗 alard (We'll need it at some point.)
20:32 🔗 DFJustin I'm sure there's lots of users whose index pages don't link to everything on the account
20:32 🔗 SmileyG DFJustin: in those cases we are screwed afaik
20:33 🔗 DFJustin just grabbing a list of sub-urls from google would probably improve it substantially
20:33 🔗 SmileyG unless google has randomly crawed them due to linking at some point
20:33 🔗 soultcer I'll generate a list of all shorturls that link to btinternet if that helps
20:34 🔗 DFJustin there's also http://wayback.archive.org/web/*/http://www.btinternet.com/~shawater/*
20:35 🔗 joepie91 well, yeah... when crawling google for usernames, why not save all the URLs along the way?
20:35 🔗 joepie91 :P
20:36 🔗 SmileyG well your gonna know the username, but yeah
20:36 🔗 SmileyG add the pages to the list for that username...
20:36 🔗 joepie91 thing is... google seems to pick up URLs from a lot of places
20:37 🔗 joepie91 not just crawling
20:37 🔗 joepie91 I've more than once seen URLs show up in the search engine index simply because it was pasted in google docs, sent via gmail, opened in a browser that uses google safe browsing, ...
20:37 🔗 joepie91 even if we can't find it, maybe google has
20:38 🔗 chronomex google safe browsing ought not submit urls, they use a bloom filter to vastly reduce queries
20:39 🔗 chronomex in most cases the url doesn't leave your machine
20:45 🔗 joepie91 ... I have an idea, one moment
20:46 🔗 joepie91 http://ip.robtex.com/213.123.20.90.html#shared
20:46 🔗 joepie91 well, that's a start
20:48 🔗 joepie91 yeah, looks like all free hosted sites are on 213.123.20.90
20:48 🔗 soultcer Oh, good idea
20:48 🔗 joepie91 literally all of them
20:48 🔗 joepie91 the IPs around it host the FTP server, real domains, and internal servers
20:48 🔗 joepie91 lol @ them being blacklisted
20:49 🔗 joepie91 this has a few as well: http://support.clean-mx.de/clean-mx/viruses.php?ip=213.123.20.90&sort=email%20asc
21:02 🔗 Wack0 what's going on
21:03 🔗 DFJustin http://bt.custhelp.com/app/answers/detail/a_id/39105/~/the-free-bt-web-hosting-service-is-closing
21:04 🔗 Wack0 oh
21:08 🔗 alard https://github.com/ArchiveTeam/btinternet-grab
21:08 🔗 alard More specifically, what about these wget options? https://github.com/ArchiveTeam/btinternet-grab/blob/master/pipeline.py#L65-86
21:12 🔗 soultcer alard: Seems fine, but wouldn't that fetch everything twice, using both URLs?
21:12 🔗 alard Yes.
21:12 🔗 soultcer Guess that's better if it goes into the wayback machine.
21:12 🔗 soultcer Does wget-warc support dedup?
21:13 🔗 alard A little bit, but not for items with different urls.
21:14 🔗 soultcer How many users do you already have?
21:14 🔗 alard 2250
21:15 🔗 SmileyG :/
21:15 🔗 SmileyG one or two people should be able to bang this out...
21:15 🔗 alard We're going to find more, right?
21:16 🔗 SmileyG I hope so :D
21:16 🔗 soultcer I hope so, I am searching both urlteam data and the domain name system for usernames
21:20 🔗 SmileyG sry guys I'm utterly out of it atm, this cold is really hammering me
21:35 🔗 soultcer alard: http://helo.nodes.soultcer.com/btinternet-dns.txt (usernames for btinternet)
21:38 🔗 alard Thanks. Processed items: 1297
21:38 🔗 alard http://tracker.archiveteam.org/btinternet/
21:45 🔗 soultcer alard: http://helo.nodes.soultcer.com/btinternet-urlteam1.txt (from URLteam this time), I'll have a second list where I scan for the co.uk version of the URL, but it have to go to bed now so I can only send that last one to you tomorrow
21:48 🔗 alard soultcer: Added. Processed items: 960, added to main queue: 621
21:48 🔗 alard (And going to bed is a sensible idea. I might copy that.)
21:49 🔗 soultcer Well extracting the URLs would take a couple hours. Not going to bed right now, but before processing would be done
21:50 🔗 alard I can give you an account on the tracker so you can add them yourself, if you'd like.
21:50 🔗 alard (Not sure how many more usernames you'll be trying to find.)
21:51 🔗 soultcer Only the ones from the second run over the URLteam data
21:51 🔗 soultcer I have no other data sources available
21:51 🔗 soultcer I am actually quite surprised I even found any hostnames via DNS at all
22:15 🔗 godane underscor: I add another: http://archive.org/details/cdrom-linuxformatmagazine-154
22:16 🔗 godane before you say its too early to upload: http://linuxformat.com/archives?issue=154
23:27 🔗 SketchCow alard: You have alardland to make new subworlds
23:35 🔗 chronomex I wonder how big soundcloud is

irclogger-viewer