Time |
Nickname |
Message |
00:26
π
|
SketchCow |
Huh. |
00:31
π
|
godane |
hey SketchCow |
00:33
π
|
BlueMax |
I assume we know about Webshots already? |
00:36
π
|
BlueMax |
http://www.webshots.com/ turning into another service, deleting all user photos that do not conform to said service |
00:56
π
|
SketchCow |
Yeah. I'm thinking we might need to go after this. Maybe. |
00:57
π
|
godane |
closes on December 1, 2012. |
00:57
π
|
godane |
so that may give use some time to decide to grab it |
00:59
π
|
godane |
it uses flash to display images |
01:01
π
|
godane |
i'm grabing theregister.co.uk by year |
01:02
π
|
godane |
there is also wget.log file |
01:07
π
|
BlueMax |
how many photos does it have, 600 million? |
01:14
π
|
SketchCow |
It says that. |
01:19
π
|
joepie91 |
godane: where do you see it use flash? |
01:20
π
|
godane |
http://www.webshots.com/pro/photo/3334729?path=/artist-kevin-mcneal |
01:21
π
|
godane |
the picture is blocked by my flash blocker |
01:21
π
|
joepie91 |
oh indeed |
01:21
π
|
joepie91 |
that's weird |
01:21
π
|
joepie91 |
ooooo |
01:21
π
|
joepie91 |
that's for the pro part of the site |
01:21
π
|
joepie91 |
the members part doesn't use flash it seems |
01:22
π
|
godane |
ok |
01:22
π
|
joepie91 |
also, the picture URLs are very easy to extract from the page source so that's not a problem |
01:22
π
|
godane |
ok |
01:23
π
|
joepie91 |
SketchCow: want me to have a run and collect as many usernames as possible? |
01:23
π
|
joepie91 |
from webshots |
01:24
π
|
joepie91 |
they seem to have a fairly parse-able index, but it seems limited to showing 10k users per category |
02:15
π
|
joepie91 |
hm. webshots is pretty big. |
02:20
π
|
Nintendud |
how i shot web |
02:34
π
|
joepie91 |
hmhm: http://i.imgur.com/X7qwe.png |
02:38
π
|
joepie91 |
well, looks like it started fetching users |
02:38
π
|
joepie91 |
http://i.imgur.com/zlFdz.png |
02:41
π
|
swagstaff |
Y'all on top of this? http://www.buzzfeed.com/katienotopoulos/your-internet-photos-are-already-starting-to-die |
02:43
π
|
swagstaff |
"... However, buried deep within their http://help.getsmileapp.com/customer/portal/articles/708519-what-if-i-don%E2%80%99t-do-anything- is the bad news. The bad news is that if you donΓ’ΒΒt log into your old Webshots account and confirm it as a new Smile account, all your photos will be deleted. ...." |
02:44
π
|
joepie91 |
swagstaff: http://i.imgur.com/zlFdz.png :) |
02:44
π
|
joepie91 |
I'm not sure if there are any plans to archive everything |
02:44
π
|
joepie91 |
but I'm already generating a list of all users, just in case |
02:47
π
|
swagstaff |
Glad to see the Team is Ever Alert. Good luck if you decide to archive. |
02:49
π
|
joepie91 |
alright, time to sleep... with a bit of luck it's done gathering usernames by tomorrow :D |
02:50
π
|
Nintendud |
joepie91: nice. |
02:50
π
|
Nintendud |
my warrior has been bored recently. |
02:52
π
|
joepie91 |
heh |
02:52
π
|
joepie91 |
also, not that it's much use since it's not really distributed, but if anyone wants the source of said script, git clone http://git.cryto.net/repo/projects/joepie91/webshots |
02:53
π
|
joepie91 |
very hacky and simple, but it works :P |
02:57
π
|
arkhive |
690 million photos apparently. |
02:58
π
|
joepie91 |
yup |
02:59
π
|
joepie91 |
time for sleep |
02:59
π
|
joepie91 |
goodnight all |
03:28
π
|
* |
DFJustin anticipates the cathedral of butthurt as photographers realize that flash doesn't keep you from saving the photos |
05:18
π
|
SketchCow |
I was wrong. |
05:18
π
|
SketchCow |
By the way. |
05:18
π
|
SketchCow |
Wayback machine has indexed 186 billion webpages. |
05:18
π
|
SketchCow |
And is expecting to do 240 billion. |
05:18
π
|
SketchCow |
Billion. |
05:27
π
|
DFJustin |
(Γ’ΒΒΓ¡ãΒΒΓ’ΒΒΓ―ΒΎΒ)Γ―ΒΎΒ |
05:40
π
|
BlueMax |
I think I just crapped my pants at that number |
06:04
π
|
Nemo_bis |
too bad Google no longer has that childish count of indexed pages |
06:07
π
|
Nemo_bis |
hmm "In 2012, Google has indexed over 30 trillion web pages, 100 billion queries per month" |
09:35
π
|
alard |
I've started a Webshots page on the wiki: http://archiveteam.org/index.php?title=Webshots |
09:36
π
|
alard |
(There are some nice comments on this "photo of the day": http://travel.webshots.com/photo/2248078140105543869vCJpvs ) |
11:28
π
|
alard |
https://github.com/ArchiveTeam/webshots-grab/ |
11:29
π
|
SmileyG |
oooo code |
11:30
π
|
* |
SmileyG looks and attempts to learn lua in 10 minutes before giving up |
11:33
π
|
SmileyG |
yup, I have no clue wtf that does :< |
11:36
π
|
joepie91 |
All photos will be removed by December 1. Until then, you may use message boards and you may search, browse and view images, but you wonΓΒ’??t be able to upload or download images. |
11:36
π
|
joepie91 |
nasty. |
11:38
π
|
joepie91 |
"hi, we're going to close down the site and tell you in advance but you can't download anything anymore by the time you're aware of it" |
11:38
π
|
joepie91 |
not that flash is a terribly good protection, but ok :P |
11:38
π
|
SmileyG |
CHALLENGE ACCEPTED! |
11:42
π
|
joepie91 |
lol |
11:44
π
|
SmileyG |
hmmm |
11:44
π
|
SmileyG |
is it a sign that I accidently named the script webshits.sh? |
11:45
π
|
joepie91 |
hahaha |
11:46
π
|
joepie91 |
mmm |
11:46
π
|
joepie91 |
AT wiki frontpage really needs an update |
12:35
π
|
joepie91 |
http://i.imgur.com/EMv89.png |
12:36
π
|
joepie91 |
found about 950k users so far |
12:36
π
|
joepie91 |
re: webshots |
12:36
π
|
SmileyG |
joepie91: how? :o |
12:37
π
|
joepie91 |
SmileyG: I started out by crawling all 'top members' for all categories |
12:37
π
|
joepie91 |
script just parses all usernames out of the page, deduplicates |
12:37
π
|
SmileyG |
hmmm |
12:37
π
|
joepie91 |
and writes the whole list to a file |
12:37
π
|
joepie91 |
it's been running for a night or so now |
12:37
π
|
SmileyG |
nice, where are these scripts :S |
12:37
π
|
joepie91 |
I estimate it's through about half of the users now |
12:38
π
|
joepie91 |
(it just collects usernames, nothing else, though) |
12:38
π
|
joepie91 |
git clone http://git.cryto.net/repo/projects/joepie91/webshots |
12:38
π
|
SmileyG |
ahhh |
12:38
π
|
joepie91 |
it's a stupidly simple script though :P |
12:38
π
|
joepie91 |
(regex is fun!) |
12:39
π
|
SmileyG |
:/ |
12:40
π
|
SmileyG |
I'm trying to understand this and i just urgh. |
12:41
π
|
joepie91 |
SmileyG: what part are you having trouble understanding? |
12:41
π
|
SmileyG |
actually your code kind of makes sense |
12:41
π
|
SmileyG |
but I just don't know how I'd ever write it :< |
12:41
π
|
joepie91 |
what it basically does is this: |
12:42
π
|
joepie91 |
retrieve community index, find all "top members" links, retrieve all those links, and for each of them find all pagination links |
12:42
π
|
joepie91 |
then for every page of every category ('top members' link), it finds all user-page URLs in the page |
12:42
π
|
joepie91 |
and extracts the username from that |
12:42
π
|
joepie91 |
that's pretty much it |
12:43
π
|
joepie91 |
structure may seem a bit odd because I'm trying to prevent it from loading the first page of a category twice |
12:43
π
|
joepie91 |
so the first page (which is the destination of the 'top members' link) is added separately |
12:43
π
|
joepie91 |
to the 'queue' |
12:43
π
|
joepie91 |
if I hadn't done that, it would've sufficed to just use a few nested foreach loops and I'd be done |
12:43
π
|
joepie91 |
:P |
12:56
π
|
SmileyG |
:/ archive.org down :? |
13:10
π
|
joepie91 |
works for me |
13:59
π
|
joepie91 |
whoop |
13:59
π
|
joepie91 |
986098 |
13:59
π
|
joepie91 |
root@aarnist:~/webshots/webshots# cat users.txt | wc -l |
14:00
π
|
joepie91 |
seems it's done :) |
14:00
π
|
joepie91 |
a remark though: |
14:00
π
|
joepie91 |
there were *very* few duplicates |
14:00
π
|
joepie91 |
that makes me think that this is really only a portion of the webshots users |
14:00
π
|
joepie91 |
every category's "top users" listing only shows 100 pages max |
14:01
π
|
joepie91 |
SmileyG, SketchCow, any suggestions as to how to get more usernames? |
14:04
π
|
joepie91 |
yeah, I was afraid of this: https://www.google.nl/search?sugexp=chrome,mod=8&sourceid=chrome&ie=UTF-8&q=site%3Acommunity.webshots.com+inurl%3Auser |
14:04
π
|
joepie91 |
about 11.400.000 results |
14:04
π
|
joepie91 |
that's about 11 times as much as I have now |
14:07
π
|
jiphex |
Safari is quite clever. If you open a link like google.com/search?q=something%20something - it parses the query and puts the query into the google search address bar thing as if you'd typed it yourself |
14:15
π
|
SmileyG |
hmmm |
14:15
π
|
SmileyG |
not really |
14:15
π
|
SmileyG |
POST. |
14:15
π
|
SmileyG |
:D |
14:17
π
|
jiphex |
erm, wrong channel :/ |
14:22
π
|
joepie91 |
SmileyG: http://aarnist.cryto.net:81/webshots/users.txt |
14:22
π
|
joepie91 |
also, SmileyG, #webshots |
14:22
π
|
alard |
(and everyone else is welcome too, of course) |
14:23
π
|
joepie91 |
:P |
17:34
π
|
alard |
SketchCow: Can you make a webshots rsync area on fos? |
17:46
π
|
SketchCow |
YEs. |
17:46
π
|
SketchCow |
How big do we think this is going to be? Probably big, huh. |
17:48
π
|
alard |
Big, yes. |
17:49
π
|
SketchCow |
I'm in that channel, let's discuss it there. |
19:53
π
|
SketchCow |
Cool thing. |
19:53
π
|
SketchCow |
Someone has donated a massive pile of recorded-off-vcr news programs from the 1980s. |
19:53
π
|
SketchCow |
So yeah |
19:55
π
|
SketchCow |
I'm packing up the first range of Cinch and putting it on the archive. |
19:55
π
|
SketchCow |
446gb of audio! |
19:55
π
|
joepie91 |
Cinch? |
19:56
π
|
chronomex |
three pounds of flax! |
20:05
π
|
SketchCow |
Cinch.FM, one of the sillier shutdowns. |
20:11
π
|
alard |
SketchCow: Keep in mind that this is not the main website, but the discussion boards: http://archive.org/details/archiveteam-city-of-heroes-main |
20:12
π
|
alard |
I have a copy of the City of Heroes website (the documentation), but I haven't uploaded that yet. |
20:41
π
|
joepie91 |
in other news: minus is basically dead, they only allow multimedia uploads now and want to move away from general purpose file storage |
20:42
π
|
chronomex |
pirated movies ONLY |
20:43
π
|
joepie91 |
lol |
20:43
π
|
Aranje |
truth |
20:43
π
|
chronomex |
worked for megaupload, right? |
20:43
π
|
chronomex |
that guy made -piles- of money |
20:44
π
|
Aranje |
It's sounding like he'll get to keep it too, pretty shortly |
20:44
π
|
Aranje |
on a long enough (or short enough!) timescale, minus will be a raging success :D |
20:45
π
|
godane |
i'm still trying to find old webuser magazines for my "collection" |
20:45
π
|
godane |
some of the missing issue only link to dead file servers now |
20:46
π
|
godane |
or there not there servers anymore |
20:47
π
|
joepie91 |
hai Aranje |
20:47
π
|
joepie91 |
long time no speak |
20:47
π
|
joepie91 |
also #webshots |
20:47
π
|
joepie91 |
:P |
20:57
π
|
SketchCow |
So, to explain - I just call it the "Main" city of heroes grab because we want to do one last supplemental grab later. |
20:57
π
|
SketchCow |
That's what I meant. |
20:59
π
|
alard |
SketchCow: Yes, I thought so. Shall I rsync you the warcs of the main website so you can add them to the item? (The 'alard' rsync space on fos is gone.) |
21:18
π
|
SketchCow |
How big? |
21:26
π
|
alard |
3.6GB |
21:28
π
|
SketchCow |
I guess I need an alard place. |
21:28
π
|
SketchCow |
Let me get that going. |
21:57
π
|
oli |
joepie91: ... |
21:57
π
|
joepie91 |
lol |
21:57
π
|
joepie91 |
hai |
21:57
π
|
oli |
i got lots of capacity in LA and the hetzner box |
21:57
π
|
joepie91 |
see #webshots |
21:57
π
|
joepie91 |
:P |