#webroasting 2020-03-05,Thu

↑back Search

Time Nickname Message
01:50 🔗 wessel151 has quit IRC (Ping timeout: 260 seconds)
06:28 🔗 wessel152 has joined #webroasting
06:32 🔗 wessel152 next list https://transfer.notkiska.pw/UJ5HU/ziggodump2.txt
07:27 🔗 wessel151 has joined #webroasting
08:50 🔗 Ryz has quit IRC (Remote host closed the connection)
08:50 🔗 Ryz has joined #webroasting
08:51 🔗 kiska18 has joined #webroasting
09:36 🔗 kiska has quit IRC (Ping timeout (120 seconds))
09:39 🔗 kiska has joined #webroasting
12:42 🔗 Jeroen has joined #webroasting
12:44 🔗 Jeroen I can see that the Ziggo Netherlands hosting shuts down on 2020-04-01. What would be the best way to archive all of it? You can find plenty of results if you google site:"members.ziggo.nl"
12:59 🔗 Jeroen I'll be going right now, not sure when I'll come back online. If anyone has a solution what the best method would be to scape and archive the pages into a WARC file, contact me.
12:59 🔗 Jeroen Discord: Jeroen#2716
12:59 🔗 Jeroen Email: jeroendeneef@gmail.com
12:59 🔗 eythian Jeroen: you can use the tool that does that
12:59 🔗 Jeroen What tool?
13:00 🔗 eythian the name of which escapes me but give me a moment
13:00 🔗 eythian https://github.com/ArchiveTeam/grab-site
13:00 🔗 eythian that one
13:00 🔗 eythian I've used it with URL lists and stuff
13:01 🔗 Jeroen Alright I will, but now I need to get a list of URLs from Google but I'll get that later.
13:01 🔗 Jeroen Thanks for the help!
13:02 🔗 eythian np
13:06 🔗 Jeroen has quit IRC (Ping timeout: 262 seconds)
13:25 🔗 wessel151 Jeroen i have got list
19:27 🔗 wessel152 next list https://transfer.notkiska.pw/rnWfO/ziggodump3.txt
19:58 🔗 wessel152 compiled a userlist with help from Jeroen https://transfer.notkiska.pw/c0wHS/ziggousers-3-5-2020.txt
20:03 🔗 wessel152 JAA do you have any tips for me
20:05 🔗 wessel152 i am thinking of splitting the load between us too
20:05 🔗 wessel152 and using https://github.com/ArchiveTeam/grab-site for that
20:06 🔗 JAA 10k users should be doable with AB.
20:08 🔗 JAA I still think we need a word list like the one I mentioned to get more results.
20:22 🔗 wessel152 if you want to try it hear is the small list https://transfer.notkiska.pw/baXEC/dutchwordlistsmall.txt
20:25 🔗 wessel152 and hear is the big list https://transfer.notkiska.pw/11vB53/dutchwordlistsbig.txt
20:42 🔗 JAA The small one might be useful, but it's still way too large to be useful for Bing scraping at least.
21:00 🔗 Craigle has joined #webroasting
21:38 🔗 Datechnom has joined #webroasting
21:38 🔗 Datechnom has left The Lounge - https://thelounge.chat
21:41 🔗 Craigle has quit IRC (hub.efnet.us ny.us.hub)
21:41 🔗 yano has quit IRC (hub.efnet.us ny.us.hub)
21:41 🔗 t3 has quit IRC (hub.efnet.us ny.us.hub)
21:41 🔗 hook54321 has quit IRC (hub.efnet.us ny.us.hub)
21:41 🔗 Ctrl-S___ has quit IRC (hub.efnet.us ny.us.hub)
21:41 🔗 tech234a has quit IRC (hub.efnet.us ny.us.hub)
21:41 🔗 AlsoJAA has quit IRC (hub.efnet.us ny.us.hub)
21:41 🔗 mr_archiv has quit IRC (hub.efnet.us ny.us.hub)
21:44 🔗 Craigle has joined #webroasting
21:44 🔗 yano has joined #webroasting
21:44 🔗 t3 has joined #webroasting
21:44 🔗 hook54321 has joined #webroasting
21:44 🔗 Ctrl-S___ has joined #webroasting
21:44 🔗 tech234a has joined #webroasting
21:44 🔗 AlsoJAA has joined #webroasting
21:44 🔗 mr_archiv has joined #webroasting
21:44 🔗 se.hub sets mode: +oo hook54321 AlsoJAA
21:44 🔗 AlsoJAA sets mode: +o JAA
21:45 🔗 JAA sets mode: +o AlsoJAA
21:55 🔗 wessel152 i'm about halfway dune white the big one
21:57 🔗 hook54321 sets mode: +o kiska
21:57 🔗 hook54321 sets mode: +o chfoo
21:57 🔗 hook54321 sets mode: +o kiska18
21:58 🔗 Jeroen has joined #webroasting
21:58 🔗 Jeroen I'm back, does grab-site use the hosts' DNS or does it have static IP addresses to DNS servers built-in?
22:05 🔗 JAA Host DNS. So don't run it if you're on a shitty connection where the provider hijacks DNS.
22:06 🔗 JAA wessel152: Is that with Google?
22:12 🔗 wessel152 yes
22:40 🔗 Jeroen @JAA What about a pihole, and is there a way to change the default DNS server?
22:47 🔗 JAA Jeroen: Hmm, yeah, also better not since the archives would be incomplete then. And what I really mean here, by the way, is: feel free to archive with whatever settings you want for your own personal use, but uploading such data to the Internet Archive is a big no-no.
22:49 🔗 JAA I don't think there's a way to directly change the nameserver. I believe it can be done with LD_PRELOAD. Otherwise, you'd have to isolate it in a VM or similar where you can change the nameserver settings freely.
23:01 🔗 wessel152 @JAA is they a way to increase the performative of grab-site
23:17 🔗 JAA wessel152: s/performative/performance/ I guess? Otherwise I have no idea what you mean. I don't think so, except having loads of RAM so that there's less disk I/O. Also, SSDs to shred.
23:19 🔗 JAA Jeroen: You're from NL? Could you translate this list of search terms for me? Specifically the words that would typically be used on such personal websites hosted for free at the ISP. 'family' 'genealogy' 'club' 'society' 'clan' 'company' 'ltd' 'home' 'index' 'wedding' 'school' 'college' 'archive' 'history' 'document' 'church' 'band' 'manual' 'product'
23:21 🔗 JAA (This is a random list of search keywords I came up with while archiving similar websites at TalkTalk, the UK provider.)
23:32 🔗 Jeroen Sure.
23:34 🔗 Jeroen @JAA what would the context for "society" and "clan" be?
23:39 🔗 atphoenix has joined #webroasting
23:42 🔗 Jeroen Also "college" here is more the way that something is taught in university.
23:44 🔗 Jeroen https://pastebin.com/DG22KEQ6
23:46 🔗 Jeroen @wessel152 Can you check the translations for accuracy?
23:54 🔗 Jeroen I should probably get in this channel with my znc.
23:58 🔗 JAA Jeroen: Society = book clubs and similar, clan = video gaming groups. And I guess the equivalent to college would be gymnasium or so.
23:59 🔗 JAA This is by no means a complete list of course, just some keywords of sites I saw a fair bit on the TalkTalk scraping.

irclogger-viewer