Time |
Nickname |
Message |
00:10
🔗
|
|
Sokar has quit IRC (Ping timeout: 258 seconds) |
00:11
🔗
|
|
BlueMax has joined #archiveteam-bs |
00:26
🔗
|
|
Sokar has joined #archiveteam-bs |
00:39
🔗
|
|
wp494 has quit IRC (Quit: LOUD UNNECESSARY QUIT MESSAGES) |
01:05
🔗
|
JAA |
"9th Circuit holds that scraping a public website likely does not violate the CFAA, even after website owner prohibits with a cease-and-desist letter; language strongly suggests CFAA only applies to bypassing authentication." |
01:05
🔗
|
JAA |
https://twitter.com/OrinKerr/status/1171116153948626944 |
01:06
🔗
|
Ryz |
Yes, all the loot, all of it~ |
01:25
🔗
|
Raccoon |
FREE AARON SWARTZ |
01:35
🔗
|
arkiver |
JAA: wooooh awesome! |
01:36
🔗
|
arkiver |
we'll get everything now |
01:37
🔗
|
Raccoon |
everyone's everything. |
02:30
🔗
|
|
Zebranky_ is now known as Zebranky |
03:31
🔗
|
|
qw3rty has joined #archiveteam-bs |
03:39
🔗
|
|
jognsmith has joined #archiveteam-bs |
03:40
🔗
|
|
qw3rty2 has quit IRC (Ping timeout: 745 seconds) |
03:44
🔗
|
|
odemgi_ has joined #archiveteam-bs |
03:45
🔗
|
|
odemg has quit IRC (Read error: Operation timed out) |
03:48
🔗
|
|
odemgi has quit IRC (Read error: Operation timed out) |
04:00
🔗
|
|
odemg has joined #archiveteam-bs |
04:39
🔗
|
|
Quirk8 has quit IRC (END OF LINE) |
04:41
🔗
|
|
Quirk8 has joined #archiveteam-bs |
04:53
🔗
|
|
tuluu has quit IRC (Quit: tuluu) |
04:56
🔗
|
|
tuluu has joined #archiveteam-bs |
05:03
🔗
|
|
larryv has quit IRC (Quit: larryv) |
06:04
🔗
|
|
killsushi has quit IRC (Ping timeout: 255 seconds) |
06:13
🔗
|
|
killsushi has joined #archiveteam-bs |
07:59
🔗
|
Fusl_ |
SketchCow: can you pull those items with warcs out of open source and put them into a separate collection + make sure they are indexed into wbm? |
07:59
🔗
|
Fusl_ |
https://archive.org/search.php?query=archiveteam_sonysketchimg_ |
08:41
🔗
|
|
killsushi has quit IRC (Quit: Leaving) |
08:52
🔗
|
godane |
SketchCow: just to let you know the new SD Times are not in the SD Times Collection you made years ago : https://archive.org/details/sdtimes |
08:53
🔗
|
godane |
example : https://archive.org/details/sdtimes287 |
08:57
🔗
|
|
deevious has quit IRC (Quit: deevious) |
09:06
🔗
|
|
godane has quit IRC (Leaving.) |
09:09
🔗
|
|
Raccoon has quit IRC (Remote host closed the connection) |
10:05
🔗
|
|
godane has joined #archiveteam-bs |
10:35
🔗
|
|
deevious has joined #archiveteam-bs |
11:18
🔗
|
|
VerifiedJ has joined #archiveteam-bs |
11:28
🔗
|
|
BlueMax has quit IRC (Read error: Connection reset by peer) |
12:12
🔗
|
|
ave_ has joined #archiveteam-bs |
12:17
🔗
|
|
DogsRNice has joined #archiveteam-bs |
12:26
🔗
|
|
Dallas has quit IRC (Quit: The Lounge - https://thelounge.chat) |
12:26
🔗
|
|
Dallas has joined #archiveteam-bs |
12:28
🔗
|
|
qw3rty has quit IRC (Ping timeout: 745 seconds) |
12:50
🔗
|
|
qw3rty has joined #archiveteam-bs |
13:09
🔗
|
|
Raccoon has joined #archiveteam-bs |
14:21
🔗
|
|
kiska1 has quit IRC (Remote host closed the connection) |
14:21
🔗
|
|
Ryz has quit IRC (Remote host closed the connection) |
14:22
🔗
|
|
Ryz has joined #archiveteam-bs |
14:22
🔗
|
|
Fusl sets mode: +o Ryz |
14:22
🔗
|
|
kiska1 has joined #archiveteam-bs |
14:22
🔗
|
|
Fusl_ sets mode: +o Ryz |
14:22
🔗
|
|
Fusl__ sets mode: +o Ryz |
14:22
🔗
|
|
Fusl__ sets mode: +o kiska1 |
14:22
🔗
|
|
svchfoo1 sets mode: +o kiska1 |
14:22
🔗
|
|
Fusl sets mode: +o kiska1 |
14:22
🔗
|
|
Fusl_ sets mode: +o kiska1 |
14:27
🔗
|
|
ave_ has quit IRC (Quit: Connection closed for inactivity) |
14:55
🔗
|
|
Raccoon` has joined #archiveteam-bs |
14:58
🔗
|
|
Raccoon has quit IRC (Ping timeout: 360 seconds) |
14:59
🔗
|
|
Raccoon` is now known as Raccoon |
15:44
🔗
|
|
larryv has joined #archiveteam-bs |
15:46
🔗
|
SketchCow |
Fusl_: Why are those not going into archiveteam_inbox |
15:48
🔗
|
SketchCow |
I've gone ahead and moved them. They'll probably go into WBM but I don't know how they do things anymore. |
15:48
🔗
|
SketchCow |
!a http://www.fyfz.cn/ |
15:52
🔗
|
Raccoon |
they should appoint you as president god emporer of WBM |
15:53
🔗
|
|
VADemon has joined #archiveteam-bs |
16:01
🔗
|
SketchCow |
Oh I do not want that job. |
16:03
🔗
|
Raccoon |
isn't that how you got into this |
16:07
🔗
|
ivan_ |
I think SketchCow does computing history archiving, not 'ingest the whole web lol' |
16:07
🔗
|
SketchCow |
I got pulled in to 'preserve software' and it turns out I did a bunch of other shit |
16:08
🔗
|
pnJay |
I thought his primary focus was soy sauce :) |
16:09
🔗
|
Raccoon |
WBM should offer a proper search engine results, but under the guise of being a public record archive to bypass european law |
16:10
🔗
|
Raccoon |
all search results are at least 30 seconds old, making it right and proper. |
16:11
🔗
|
ivan_ |
Raccoon is going to donate the petabyte and the expertise to make it happen |
16:12
🔗
|
arkiver |
Awesome :) |
16:12
🔗
|
arkiver |
:P |
16:12
🔗
|
|
RichardG has quit IRC (Ping timeout: 246 seconds) |
16:13
🔗
|
|
K4k has joined #archiveteam-bs |
16:13
🔗
|
Raccoon |
even just old school 2003 google or 1998 Altavista would be nice |
16:14
🔗
|
Raccoon |
as long as I can get search results that aren't pre-filtered for my protection, de-ribbed for my pleasure. |
16:14
🔗
|
arkiver |
ivan_: he doesn't get it :P |
16:16
🔗
|
Raccoon |
WBM is supposed to already have all the page content. just how large would the index be to make it searchable? |
16:17
🔗
|
Raccoon |
and while bsing about this, is there any way to search WBM for all page titles beginning with "Index of" like I used to be able to do with Google up until the last few years |
16:18
🔗
|
Raccoon |
I miss the glory days of 6 to 10 years ago when I was wget'ing every open directory for the sake of filling harddrives |
16:18
🔗
|
ivan_ |
377 billion pages * 10KB of actual text = 3.77PB |
16:19
🔗
|
Raccoon |
can it be indexed? |
16:19
🔗
|
ivan_ |
what do you think indexes are made of, Raccoon |
16:20
🔗
|
* |
Raccoon searches for any ex-HOTBOT employees might still be alive in 2019 |
16:20
🔗
|
ivan_ |
unless you've got exotic compression schemes it's something like a giant KV of (normalized word) -> a list of pointers to every document that has word |
16:22
🔗
|
Raccoon |
if 90% of words in a book are structural connector words, we can probably shave it down to just indexing 10% of a page's content. Those words that rank with low popularity |
16:22
🔗
|
Raccoon |
words like 'bukake' or 'palin nudes' |
16:23
🔗
|
Fusl_ |
SketchCow: they were uploaded prior to the existence of the inbox |
16:24
🔗
|
SketchCow |
What a lame excuse |
16:24
🔗
|
SketchCow |
Anyway, all set |
16:24
🔗
|
Raccoon |
also betting a good chunk of that 3.7PB is html tags, tables, scripts, and now css |
16:25
🔗
|
Raccoon |
each page could probably be assigned with just 10 to 100 english index words. |
16:29
🔗
|
SketchCow |
Speaking of BS |
16:30
🔗
|
SketchCow |
So, years ago I made that cute thing that would look at a archivebot item, take out a nice pleasant set of screenshots of the pages, and then post them as .jpg files just so the things looked good. |
16:30
🔗
|
SketchCow |
I'd love to do that again - my concern is I could get my mega-hacky thing working again, but it's probably stupid easy now. |
16:30
🔗
|
SketchCow |
Maybe someone has something lying around - otherwise, I can go find my scripts and get them going. |
16:31
🔗
|
SketchCow |
Example of what I mean: https://archive.org/details/archiveteam_archivebot_go_20150107190002 |
16:33
🔗
|
DogsRNice |
thats really neat |
16:34
🔗
|
PurpleSym |
SketchCow: Something like: google-chrome-stable --headless --disable-gpu --screenshot --window-size=1920,1080 <url> |
16:34
🔗
|
* |
Raccoon thinks he just got Cow shatted #bs :) |
16:34
🔗
|
* |
phillipsj got the soy sauce reference. |
16:34
🔗
|
DogsRNice |
https://ia902302.us.archive.org/7/items/archiveteam_archivebot_go_082/www.nc911truth.org-inf-20140728-030309-p4pky-00000.warc.gz.png |
16:34
🔗
|
DogsRNice |
oh no... |
16:35
🔗
|
SketchCow |
That's why we save them |
16:36
🔗
|
DogsRNice |
yeah i get it |
16:40
🔗
|
VADemon |
> 1MB "preview" screenshot in .png |
16:43
🔗
|
DogsRNice |
i just found one of chipotles twitter with a swastica on it |
16:43
🔗
|
DogsRNice |
https://ia802603.us.archive.org/35/items/archiveteam_archivebot_go_20150209010002/twitter.com-inf-20150208-022652-a8aok-00000.warc.gz.png |
16:44
🔗
|
DogsRNice |
someone really dosnt like burretos |
16:46
🔗
|
phillipsj |
VADemon, jpg would probably make the text hard to read. |
16:47
🔗
|
|
systwiALT has joined #archiveteam-bs |
16:49
🔗
|
VADemon |
I appreciate the lossless quality but its a preview. It's not supposed to be larger than the item. (+unoptimized png) |
16:51
🔗
|
|
systwiAL_ has quit IRC (Read error: Operation timed out) |
16:54
🔗
|
|
systwiALT has quit IRC (Read error: Operation timed out) |
17:18
🔗
|
|
VerifiedJ has quit IRC (Quit: Leaving) |
17:32
🔗
|
|
RichardG has joined #archiveteam-bs |
18:26
🔗
|
SketchCow |
PurpleSym: Let me try it |
18:28
🔗
|
SketchCow |
http://teamarchive1.fnf.archive.org/screenshot.png |
18:28
🔗
|
SketchCow |
No fuckin' complaints |
18:41
🔗
|
|
jognsmith has quit IRC (Remote host closed the connection) |
18:45
🔗
|
SketchCow |
I found my warc screenshotter (It's called WEBBERGRABBER) and will now do the work, and thanks to you it'll do screenshots REALLY fast. |
18:45
🔗
|
SketchCow |
So that's appreciated. |
18:54
🔗
|
Sanqui |
I'm excited for more screenshots. |
19:14
🔗
|
|
Jens has quit IRC (Remote host closed the connection) |
19:14
🔗
|
|
Jens has joined #archiveteam-bs |
19:27
🔗
|
SketchCow |
Oops, wiped a script out |
19:27
🔗
|
SketchCow |
Well, luckily it doesn't do much |
19:40
🔗
|
|
ndiddy has quit IRC (Quit: WeeChat 1.4) |
19:41
🔗
|
|
ndiddy has joined #archiveteam-bs |
19:58
🔗
|
SketchCow |
OK, screenshotter's back in business. |
19:58
🔗
|
SketchCow |
http://teamarchive1.fnf.archive.org/WEBGRAB/ |
20:27
🔗
|
|
Stiletto has quit IRC (Read error: Operation timed out) |
20:30
🔗
|
|
Stiletto has joined #archiveteam-bs |
20:58
🔗
|
|
katocala has quit IRC () |
21:06
🔗
|
|
Raccoon has quit IRC (Read error: Connection reset by peer) |
21:09
🔗
|
|
katocala has joined #archiveteam-bs |
21:32
🔗
|
|
kiskabak has quit IRC (Remote host closed the connection) |
21:32
🔗
|
|
kiskabak has joined #archiveteam-bs |
21:32
🔗
|
|
Fusl sets mode: +o kiskabak |
21:32
🔗
|
|
Fusl__ sets mode: +o kiskabak |
21:32
🔗
|
|
Fusl_ sets mode: +o kiskabak |
22:06
🔗
|
|
killsushi has joined #archiveteam-bs |
22:37
🔗
|
|
jognsmith has joined #archiveteam-bs |
22:44
🔗
|
jognsmith |
Hello arkiver :) |
22:44
🔗
|
arkiver |
you said fotolog? |
22:44
🔗
|
arkiver |
https://www.archiveteam.org/index.php?title=Fotolog this? |
22:44
🔗
|
jognsmith |
sorry i meant live spaces, my bad |
22:44
🔗
|
JAA |
https://www.archiveteam.org/index.php?title=Spaces_of_Windows_Live_Spaces_pending_for_download |
22:44
🔗
|
|
Smiley has quit IRC (Read error: Operation timed out) |
22:44
🔗
|
JAA |
(Link in the main chan is broken) |
22:44
🔗
|
arkiver |
alright |
22:44
🔗
|
arkiver |
yeah I saw the page |
22:44
🔗
|
JAA |
The IRC logs don't go back that far. |
22:44
🔗
|
arkiver |
i was confused since he said fotolog |
22:45
🔗
|
arkiver |
I'm not sure if it was saved, which one was yours? |
22:45
🔗
|
arkiver |
(I was not involved in this project) |
22:45
🔗
|
arkiver |
jognsmith: ^ |
22:45
🔗
|
jognsmith |
photosoffmycats.spaces.live.com |
22:46
🔗
|
jognsmith |
(i'd like to ask later about fotolog as well) |
22:46
🔗
|
arkiver |
alright and which one for fotolog? |
22:47
🔗
|
arkiver |
for fotolog.com |
22:48
🔗
|
jognsmith |
wolf_alex |
22:48
🔗
|
arkiver |
ok |
22:48
🔗
|
arkiver |
If we have http://photosoffmycats.spaces.live.com/, I'm not sure where it is |
22:48
🔗
|
arkiver |
perhaps chfoo knows something |
22:48
🔗
|
jognsmith |
oh :c does that mean its lost? |
22:49
🔗
|
arkiver |
could be |
22:49
🔗
|
arkiver |
looking into the fotolog one now |
22:49
🔗
|
jognsmith |
thank you |
22:49
🔗
|
JAA |
So apparently that list is also part of these grabs: https://www.archiveteam.org/index.php?title=Talk:Windows_Live_Spaces#Phase_2:_Downloading_Hotlists |
22:49
🔗
|
JAA |
Which are "Uploaded, awaiting verification", so at least they were grabbed at some point. |
22:50
🔗
|
JAA |
underscor: According to the wiki page, you were running an FTP server for that project at the time. Do you know anything? |
22:52
🔗
|
jognsmith |
oh! if they were grabbed it could mean good news i guess |
22:52
🔗
|
arkiver |
are you sure it was wolf_alex? |
22:52
🔗
|
arkiver |
so fotolog.com/wolf_alex ? |
22:53
🔗
|
arkiver |
because I can't find it, and it's also not in the list of account we archived from fotolog. |
22:53
🔗
|
jognsmith |
yes, it was http://www.fotolog.com/wolf_alex |
22:54
🔗
|
jognsmith |
probably it wouldnt be grabbed though, i didnt have that much followers |
22:55
🔗
|
arkiver |
I think we discovered users by checking followers, etc. |
22:55
🔗
|
arkiver |
yeah I don't see it in the lists of account the archived :/ |
22:55
🔗
|
arkiver |
hopefully there will still be good news on spaces |
22:55
🔗
|
jognsmith |
yeah i guessed so :/ it was a small site |
22:56
🔗
|
jognsmith |
yes! the fact that the link is in the wiki gives me hope |
23:20
🔗
|
SketchCow |
Happy to say the archivebot screenshotter works. |
23:20
🔗
|
SketchCow |
(Just did a full-run.) |
23:21
🔗
|
SketchCow |
Now I'm running it against archivebot in general. |
23:21
🔗
|
JAA |
Nice |
23:26
🔗
|
godane |
so that gaming computer i wanted to get now back up to $700 |
23:26
🔗
|
godane |
was on sale for $580 |
23:31
🔗
|
|
coderobe has quit IRC (Remote host closed the connection) |
23:38
🔗
|
godane |
SketchCow: what computer pre-build would you get for $600? |
23:38
🔗
|
godane |
i was looking at this but it went back up to $700 : https://www.amazon.com/Dell-Inspiron-Desktop-Processor-Graphics/dp/B07Q3G3B67/ |