Time |
Nickname |
Message |
01:50
🔗
|
|
wessel151 has quit IRC (Ping timeout: 260 seconds) |
06:28
🔗
|
|
wessel152 has joined #webroasting |
06:32
🔗
|
wessel152 |
next list https://transfer.notkiska.pw/UJ5HU/ziggodump2.txt |
07:27
🔗
|
|
wessel151 has joined #webroasting |
08:50
🔗
|
|
Ryz has quit IRC (Remote host closed the connection) |
08:50
🔗
|
|
Ryz has joined #webroasting |
08:51
🔗
|
|
kiska18 has joined #webroasting |
09:36
🔗
|
|
kiska has quit IRC (Ping timeout (120 seconds)) |
09:39
🔗
|
|
kiska has joined #webroasting |
12:42
🔗
|
|
Jeroen has joined #webroasting |
12:44
🔗
|
Jeroen |
I can see that the Ziggo Netherlands hosting shuts down on 2020-04-01. What would be the best way to archive all of it? You can find plenty of results if you google site:"members.ziggo.nl" |
12:59
🔗
|
Jeroen |
I'll be going right now, not sure when I'll come back online. If anyone has a solution what the best method would be to scape and archive the pages into a WARC file, contact me. |
12:59
🔗
|
Jeroen |
Discord: Jeroen#2716 |
12:59
🔗
|
Jeroen |
Email: jeroendeneef@gmail.com |
12:59
🔗
|
eythian |
Jeroen: you can use the tool that does that |
12:59
🔗
|
Jeroen |
What tool? |
13:00
🔗
|
eythian |
the name of which escapes me but give me a moment |
13:00
🔗
|
eythian |
https://github.com/ArchiveTeam/grab-site |
13:00
🔗
|
eythian |
that one |
13:00
🔗
|
eythian |
I've used it with URL lists and stuff |
13:01
🔗
|
Jeroen |
Alright I will, but now I need to get a list of URLs from Google but I'll get that later. |
13:01
🔗
|
Jeroen |
Thanks for the help! |
13:02
🔗
|
eythian |
np |
13:06
🔗
|
|
Jeroen has quit IRC (Ping timeout: 262 seconds) |
13:25
🔗
|
wessel151 |
Jeroen i have got list |
19:27
🔗
|
wessel152 |
next list https://transfer.notkiska.pw/rnWfO/ziggodump3.txt |
19:58
🔗
|
wessel152 |
compiled a userlist with help from Jeroen https://transfer.notkiska.pw/c0wHS/ziggousers-3-5-2020.txt |
20:03
🔗
|
wessel152 |
JAA do you have any tips for me |
20:05
🔗
|
wessel152 |
i am thinking of splitting the load between us too |
20:05
🔗
|
wessel152 |
and using https://github.com/ArchiveTeam/grab-site for that |
20:06
🔗
|
JAA |
10k users should be doable with AB. |
20:08
🔗
|
JAA |
I still think we need a word list like the one I mentioned to get more results. |
20:22
🔗
|
wessel152 |
if you want to try it hear is the small list https://transfer.notkiska.pw/baXEC/dutchwordlistsmall.txt |
20:25
🔗
|
wessel152 |
and hear is the big list https://transfer.notkiska.pw/11vB53/dutchwordlistsbig.txt |
20:42
🔗
|
JAA |
The small one might be useful, but it's still way too large to be useful for Bing scraping at least. |
21:00
🔗
|
|
Craigle has joined #webroasting |
21:38
🔗
|
|
Datechnom has joined #webroasting |
21:38
🔗
|
|
Datechnom has left The Lounge - https://thelounge.chat |
21:41
🔗
|
|
Craigle has quit IRC (hub.efnet.us ny.us.hub) |
21:41
🔗
|
|
yano has quit IRC (hub.efnet.us ny.us.hub) |
21:41
🔗
|
|
t3 has quit IRC (hub.efnet.us ny.us.hub) |
21:41
🔗
|
|
hook54321 has quit IRC (hub.efnet.us ny.us.hub) |
21:41
🔗
|
|
Ctrl-S___ has quit IRC (hub.efnet.us ny.us.hub) |
21:41
🔗
|
|
tech234a has quit IRC (hub.efnet.us ny.us.hub) |
21:41
🔗
|
|
AlsoJAA has quit IRC (hub.efnet.us ny.us.hub) |
21:41
🔗
|
|
mr_archiv has quit IRC (hub.efnet.us ny.us.hub) |
21:44
🔗
|
|
Craigle has joined #webroasting |
21:44
🔗
|
|
yano has joined #webroasting |
21:44
🔗
|
|
t3 has joined #webroasting |
21:44
🔗
|
|
hook54321 has joined #webroasting |
21:44
🔗
|
|
Ctrl-S___ has joined #webroasting |
21:44
🔗
|
|
tech234a has joined #webroasting |
21:44
🔗
|
|
AlsoJAA has joined #webroasting |
21:44
🔗
|
|
mr_archiv has joined #webroasting |
21:44
🔗
|
|
se.hub sets mode: +oo hook54321 AlsoJAA |
21:44
🔗
|
|
AlsoJAA sets mode: +o JAA |
21:45
🔗
|
|
JAA sets mode: +o AlsoJAA |
21:55
🔗
|
wessel152 |
i'm about halfway dune white the big one |
21:57
🔗
|
|
hook54321 sets mode: +o kiska |
21:57
🔗
|
|
hook54321 sets mode: +o chfoo |
21:57
🔗
|
|
hook54321 sets mode: +o kiska18 |
21:58
🔗
|
|
Jeroen has joined #webroasting |
21:58
🔗
|
Jeroen |
I'm back, does grab-site use the hosts' DNS or does it have static IP addresses to DNS servers built-in? |
22:05
🔗
|
JAA |
Host DNS. So don't run it if you're on a shitty connection where the provider hijacks DNS. |
22:06
🔗
|
JAA |
wessel152: Is that with Google? |
22:12
🔗
|
wessel152 |
yes |
22:40
🔗
|
Jeroen |
@JAA What about a pihole, and is there a way to change the default DNS server? |
22:47
🔗
|
JAA |
Jeroen: Hmm, yeah, also better not since the archives would be incomplete then. And what I really mean here, by the way, is: feel free to archive with whatever settings you want for your own personal use, but uploading such data to the Internet Archive is a big no-no. |
22:49
🔗
|
JAA |
I don't think there's a way to directly change the nameserver. I believe it can be done with LD_PRELOAD. Otherwise, you'd have to isolate it in a VM or similar where you can change the nameserver settings freely. |
23:01
🔗
|
wessel152 |
@JAA is they a way to increase the performative of grab-site |
23:17
🔗
|
JAA |
wessel152: s/performative/performance/ I guess? Otherwise I have no idea what you mean. I don't think so, except having loads of RAM so that there's less disk I/O. Also, SSDs to shred. |
23:19
🔗
|
JAA |
Jeroen: You're from NL? Could you translate this list of search terms for me? Specifically the words that would typically be used on such personal websites hosted for free at the ISP. 'family' 'genealogy' 'club' 'society' 'clan' 'company' 'ltd' 'home' 'index' 'wedding' 'school' 'college' 'archive' 'history' 'document' 'church' 'band' 'manual' 'product' |
23:21
🔗
|
JAA |
(This is a random list of search keywords I came up with while archiving similar websites at TalkTalk, the UK provider.) |
23:32
🔗
|
Jeroen |
Sure. |
23:34
🔗
|
Jeroen |
@JAA what would the context for "society" and "clan" be? |
23:39
🔗
|
|
atphoenix has joined #webroasting |
23:42
🔗
|
Jeroen |
Also "college" here is more the way that something is taught in university. |
23:44
🔗
|
Jeroen |
https://pastebin.com/DG22KEQ6 |
23:46
🔗
|
Jeroen |
@wessel152 Can you check the translations for accuracy? |
23:54
🔗
|
Jeroen |
I should probably get in this channel with my znc. |
23:58
🔗
|
JAA |
Jeroen: Society = book clubs and similar, clan = video gaming groups. And I guess the equivalent to college would be gymnasium or so. |
23:59
🔗
|
JAA |
This is by no means a complete list of course, just some keywords of sites I saw a fair bit on the TalkTalk scraping. |