#archiveteam-bs 2018-11-09,Fri

↑back Search

Time Nickname Message
00:00 πŸ”— godane has quit IRC (Quit: Leaving.)
00:16 πŸ”— SketchCow Glad to see all the editing going on on the Wiki
00:19 πŸ”— twoTBHetz has joined #archiveteam-bs
00:19 πŸ”— kyounko has joined #archiveteam-bs
00:22 πŸ”— astrid twoTBHetz: it is in the /topic
00:24 πŸ”— twoTBHetz I my discussion not "long" or "lengthy" so i did not think i was in the wrong place
00:24 πŸ”— twoTBHetz *I found my
00:25 πŸ”— twoTBHetz If the consensus is that i was wrong (which it seems) i will do better in the future but for somebody who is new here it is not easy to infer those rules.
00:26 πŸ”— twoTBHetz astrid have i overlooked anything?
00:27 πŸ”— * astrid sighs
00:27 πŸ”— astrid there are 230 people in that channel
00:27 πŸ”— astrid we try to keep discussions to a half dozen lines or so
00:28 πŸ”— astrid don't take it personally
00:28 πŸ”— astrid make sense?
00:30 πŸ”— twoTBHetz Thanks for the explonation. I am used to smaller channels.
00:31 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
00:34 πŸ”— Sk1d has joined #archiveteam-bs
01:06 πŸ”— godane has joined #archiveteam-bs
01:08 πŸ”— Kaz -bs alarm goes off in my head at around 3 lines
01:10 πŸ”— twoTBHetz Kaz what do you mean by that?
01:13 πŸ”— Kaz More than 3 lines in #archiveteam is when conversation needs to move to -bs, imo
01:14 πŸ”— Kaz Context is everything, though
01:18 πŸ”— Ryz Woo, my efforts have been paying off! o:
01:21 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
01:26 πŸ”— Sk1d has joined #archiveteam-bs
01:29 πŸ”— SketchCow Kudos to twoTBHetz for using -bs as an appeals court for the #archiveteam channel
01:37 πŸ”— Martle has joined #archiveteam-bs
01:40 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
01:42 πŸ”— Sk1d has joined #archiveteam-bs
01:43 πŸ”— VADemon has joined #archiveteam-bs
01:43 πŸ”— VerifiedJ has quit IRC (Quit: Leaving)
01:56 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
01:59 πŸ”— Sk1d has joined #archiveteam-bs
02:00 πŸ”— Ryz has quit IRC (Quit: ChatZilla 0.9.92-rdmsoft [XULRunner 35.0.1/20150122214805])
02:42 πŸ”— BlueMax has joined #archiveteam-bs
03:08 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
03:09 πŸ”— dashcloud has quit IRC (Remote host closed the connection)
03:10 πŸ”— dashcloud has joined #archiveteam-bs
03:11 πŸ”— Sk1d has joined #archiveteam-bs
03:19 πŸ”— bithippo has joined #archiveteam-bs
03:23 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
03:26 πŸ”— Sk1d has joined #archiveteam-bs
04:20 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
04:23 πŸ”— Sk1d has joined #archiveteam-bs
04:39 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
04:42 πŸ”— qw3rty115 has joined #archiveteam-bs
04:43 πŸ”— Sk1d has joined #archiveteam-bs
04:48 πŸ”— qw3rty114 has quit IRC (Ping timeout: 600 seconds)
04:57 πŸ”— odemg has quit IRC (Ping timeout: 246 seconds)
05:05 πŸ”— Ryz has joined #archiveteam-bs
05:10 πŸ”— odemg has joined #archiveteam-bs
05:11 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
05:14 πŸ”— Sk1d has joined #archiveteam-bs
05:16 πŸ”— kyounko has quit IRC (Read error: Connection reset by peer)
05:26 πŸ”— mgrytbak_ has quit IRC (Read error: Connection reset by peer)
05:30 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
05:35 πŸ”— Sk1d has joined #archiveteam-bs
05:42 πŸ”— adinbied has quit IRC (Read error: Operation timed out)
05:46 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
05:48 πŸ”— mgrytbak^ has joined #archiveteam-bs
05:50 πŸ”— adinbied has joined #archiveteam-bs
05:52 πŸ”— Sk1d has joined #archiveteam-bs
06:04 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
06:06 πŸ”— Ryz has quit IRC (hub.efnet.us west.us.hub)
06:06 πŸ”— VADemon has quit IRC (hub.efnet.us west.us.hub)
06:06 πŸ”— Mateon1 has quit IRC (hub.efnet.us west.us.hub)
06:06 πŸ”— twoTBHetz has quit IRC (hub.efnet.us west.us.hub)
06:06 πŸ”— achip has quit IRC (hub.efnet.us west.us.hub)
06:06 πŸ”— me has quit IRC (Read error: Operation timed out)
06:09 πŸ”— me has joined #archiveteam-bs
06:10 πŸ”— Sk1d has joined #archiveteam-bs
06:11 πŸ”— Ryz has joined #archiveteam-bs
06:11 πŸ”— VADemon has joined #archiveteam-bs
06:11 πŸ”— twoTBHetz has joined #archiveteam-bs
06:11 πŸ”— Mateon1 has joined #archiveteam-bs
06:11 πŸ”— achip has joined #archiveteam-bs
06:13 πŸ”— DFJustin has quit IRC (Ping timeout: 260 seconds)
06:15 πŸ”— wp494 has quit IRC (Ping timeout: 265 seconds)
06:15 πŸ”— wp494 has joined #archiveteam-bs
06:20 πŸ”— icedice has quit IRC (Quit: Leaving)
06:25 πŸ”— DFJustin has joined #archiveteam-bs
06:25 πŸ”— swebb sets mode: +o DFJustin
06:56 πŸ”— Stiletto has joined #archiveteam-bs
06:59 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
07:00 πŸ”— Stilett0 has quit IRC (Read error: Operation timed out)
07:04 πŸ”— Sk1d has joined #archiveteam-bs
07:33 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
07:35 πŸ”— twoTBHetz has quit IRC (Read error: Operation timed out)
07:37 πŸ”— Sk1d has joined #archiveteam-bs
07:51 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
07:55 πŸ”— Sk1d has joined #archiveteam-bs
08:04 πŸ”— twoTBHetz has joined #archiveteam-bs
08:54 πŸ”— hiroi has joined #archiveteam-bs
09:46 πŸ”— twoTBHetz has quit IRC (Read error: Operation timed out)
09:57 πŸ”— twoTBHetz has joined #archiveteam-bs
10:36 πŸ”— Ryz has quit IRC (Quit: ChatZilla 0.9.92-rdmsoft [XULRunner 35.0.1/20150122214805])
11:18 πŸ”— betamax OK, so I have some lists of videos from campaign sites
11:18 πŸ”— betamax more specifically, lists of URLs that contain the word 'youtube'
11:18 πŸ”— betamax so quite a few might not actually be youtube
11:18 πŸ”— betamax https://transfer.sh/xjK12/midterm-2018-youtube.txt
11:19 πŸ”— betamax note that some of the urls start with '//' because the 'https:' was filtered off but not the '//'
11:19 πŸ”— betamax sort about that
11:19 πŸ”— betamax oh, and here's one for vimeo:
11:19 πŸ”— betamax https://transfer.sh/WQYAs/midterm-2018-vimeo.txt
11:24 πŸ”— betamax cc ivan ^^
11:29 πŸ”— betamax oh, you may want to dedupe those files
11:29 πŸ”— betamax just checked, and a lot are duplicates
11:29 πŸ”— PurpleSym betamax: The chromebot grab for Facebook is done. Working on Twitter right now, but it’ll take a few days.
11:30 πŸ”— betamax great! I got all the tweets into archivebot as well
11:33 πŸ”— Flashfire I may have footage of the stabbing that happened in Melbourne but the person who has it won’t send it to me so it will be screen recorded
11:33 πŸ”— betamax here's some more urls containing 'twitter' (note ~75MB) https://transfer.sh/b2Vco/midterm-2018-twitter-extract.txt
11:33 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
11:37 πŸ”— Sk1d has joined #archiveteam-bs
11:51 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
11:52 πŸ”— twoTBHetz has quit IRC (Read error: Operation timed out)
11:54 πŸ”— Sk1d has joined #archiveteam-bs
12:03 πŸ”— twoTBHetz has joined #archiveteam-bs
12:06 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
12:12 πŸ”— Sk1d has joined #archiveteam-bs
12:16 πŸ”— twoTBHetz has quit IRC (Read error: Operation timed out)
12:23 πŸ”— BlueMax has quit IRC (Read error: Connection reset by peer)
12:26 πŸ”— twoTBHetz has joined #archiveteam-bs
12:26 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
12:29 πŸ”— Sk1d has joined #archiveteam-bs
13:15 πŸ”— Gtyy has joined #archiveteam-bs
13:15 πŸ”— Gtyy has quit IRC (Client Quit)
13:49 πŸ”— godane SketchCow: latest scan : https://archive.org/details/pc-computing-magazine-v9i2
14:00 πŸ”— vitzli has joined #archiveteam-bs
14:29 πŸ”— Mateon1 has quit IRC (Read error: Connection reset by peer)
14:29 πŸ”— Mateon1 has joined #archiveteam-bs
14:44 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
14:46 πŸ”— Sk1d has joined #archiveteam-bs
15:01 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
15:04 πŸ”— Sk1d has joined #archiveteam-bs
15:28 πŸ”— bithippo has quit IRC (Textual IRC Client: www.textualapp.com)
15:47 πŸ”— DFJustin has quit IRC (Ping timeout: 260 seconds)
15:48 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
15:50 πŸ”— schbirid has joined #archiveteam-bs
15:50 πŸ”— Sk1d has joined #archiveteam-bs
15:51 πŸ”— ivan thanks for the new list of youtube, I am processing it
15:52 πŸ”— ivan betamax: someone else can handle vimeo better than I can
15:59 πŸ”— DFJustin has joined #archiveteam-bs
15:59 πŸ”— swebb sets mode: +o DFJustin
16:03 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
16:04 πŸ”— balrog has anyone done any looking into mail-archive.com?
16:07 πŸ”— Sk1d has joined #archiveteam-bs
16:12 πŸ”— twoTBHetz has quit IRC (Read error: Operation timed out)
16:14 πŸ”— LFlare has quit IRC (Quit: The Lounge - https://thelounge.chat)
16:22 πŸ”— twoTBHetz has joined #archiveteam-bs
16:24 πŸ”— VerifiedJ has joined #archiveteam-bs
16:32 πŸ”— vitzli has quit IRC (Quit: Leaving)
16:32 πŸ”— godane i'm reuploading v9i2 of pc computing cause one the pages was not fliped
16:33 πŸ”— godane i also delete derived files cause new derived files was not being made for some reason from what i can tell
16:37 πŸ”— jspiros has quit IRC (leaving)
16:48 πŸ”— chfoo has quit IRC (Read error: Operation timed out)
16:50 πŸ”— chfoo has joined #archiveteam-bs
16:55 πŸ”— Martle has quit IRC (Quit: Leaving)
17:05 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
17:08 πŸ”— Sk1d has joined #archiveteam-bs
17:12 πŸ”— jspiros has joined #archiveteam-bs
17:37 πŸ”— Martle has joined #archiveteam-bs
17:37 πŸ”— vectr0n has quit IRC (Quit: ZNC - https://znc.in)
17:43 πŸ”— Martle has quit IRC (Read error: Connection reset by peer)
17:44 πŸ”— Martle has joined #archiveteam-bs
17:51 πŸ”— vectr0n has joined #archiveteam-bs
18:05 πŸ”— Ryz has joined #archiveteam-bs
18:07 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
18:09 πŸ”— Sk1d has joined #archiveteam-bs
18:12 πŸ”— twoTBHetz Ryz, what can be done about that?
18:13 πŸ”— Ryz Archive the website it has?
18:13 πŸ”— Ryz Prima Games has a website - https://www.primagames.com/
18:14 πŸ”— Ryz It looks different than the last time I even bothered to look at the website
18:18 πŸ”— godane i'm reuploading v9i3 of pc computing cause i forgot to black out the address on cover
18:21 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
18:26 πŸ”— Sk1d has joined #archiveteam-bs
18:32 πŸ”— Ryz Also, maybe archiving Prima Games guides
18:32 πŸ”— Ryz Those strategy guides specifically
18:40 πŸ”— Kenshin has quit IRC (Ping timeout: 260 seconds)
18:40 πŸ”— Kenshin has joined #archiveteam-bs
18:48 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
18:51 πŸ”— twoTBHetz It seems to be quiet well archived ... the starting page atleast
18:54 πŸ”— Sk1d has joined #archiveteam-bs
19:07 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
19:12 πŸ”— Sk1d has joined #archiveteam-bs
19:28 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
19:31 πŸ”— betamax so I've been thinking about Hostinger
19:32 πŸ”— betamax I'm wondering about the possibility of doing a warrior project
19:32 πŸ”— Sk1d has joined #archiveteam-bs
19:35 πŸ”— betamax this would probably depend on discovering a sufficant number of subdomains
19:35 πŸ”— betamax twoTBHetz: how did you go about the discovery for the subdomains you've found? (Don't want to repeat anything!)
19:36 πŸ”— kiska What are hostinger's base domain? I can try and zcat a 500GiB DNS file
19:37 πŸ”— betamax they use several
19:37 πŸ”— kiska Can you list one of them?
19:38 πŸ”— kiska Can you list them? *
19:38 πŸ”— betamax twoTBHetz ^^
19:38 πŸ”— betamax (I can't remember them all)
19:38 πŸ”— twoTBHetz i used a tool called Sublist3r which uses search engines, anti virus sites and co to look for used subdomains. Sadly google banned me quiet early (since i used the brute force option in the beginning) with more ips to spare you can get better results
19:38 πŸ”— kiska right, I'll look in the logs then
19:39 πŸ”— twoTBHetz 16mb.com.on.txt ahol.es.on.txt besaba.com.on.txt pe.hu.on.txt
19:39 πŸ”— twoTBHetz 890m.com.on.txt all_hostinger.txt esy.es.on.txt zz.vc.on.txt
19:39 πŸ”— twoTBHetz 96.lt.on.txt azz.vc.on.txt hol.es.on.txt
19:39 πŸ”— twoTBHetz drop the .on.txt
19:39 πŸ”— kiska Right, give me about 15 mins per domain
19:39 πŸ”— Martle has quit IRC (Ping timeout: 252 seconds)
19:39 πŸ”— betamax looking online, I see: zz.vc wc.lt pe.hu 890m.com 16mb.com vv.si
19:40 πŸ”— kiska Actually give me 30 mins to download the new dns file, mine is somewhat outdated, since its the September 2018 edition
19:40 πŸ”— twoTBHetz betamax, was faster
19:42 πŸ”— kiska .... PM spam...
19:44 πŸ”— twoTBHetz better than channel spam?
19:44 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
19:45 πŸ”— betamax perhaps someone wants to look into discovery using the wayback CDX?
19:45 πŸ”— betamax for instance: http://web.archive.org/cdx/search/cdx?url=16mb.com&matchType=domain
19:45 πŸ”— betamax (note: big page, my browser just crashed)
19:45 πŸ”— betamax although that's limited to a certain number of results, there are ways round this
19:46 πŸ”— betamax see: https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server
19:48 πŸ”— kiska Right downloading new data will take 55 mins, so I'll use the old data for now
19:49 πŸ”— jodizzle ivan: I can grab the vimeo, but why do you have trouble handling it, out of curiosity?
19:49 πŸ”— Martle has joined #archiveteam-bs
19:50 πŸ”— Sk1d has joined #archiveteam-bs
19:53 πŸ”— kiska I used sublist3r earlier for the tian crawl, it crashed my network
19:56 πŸ”— kiska twoTBHetz how are the subdomains organised? Are they "*.16mb.com" or are they "16mb.com/*"
19:56 πŸ”— kiska Where * = the user
19:56 πŸ”— twoTBHetz *.16mb.com
19:56 πŸ”— kiska Excellent!
19:57 πŸ”— kiska Right I have ~500GiB of fdns data to grep xD
19:57 πŸ”— twoTBHetz potentially *.*.16mb.com
19:57 πŸ”— kiska O_O
19:57 πŸ”— twoTBHetz kiska how did you get that dataset i searched for something like that i did not find much
19:58 πŸ”— kiska I am using my the sonar dataset
19:58 πŸ”— twoTBHetz ?
19:59 πŸ”— kiska Using the sonar dataset*
20:00 πŸ”— kiska The server I am running the grep from doesn't like me currently, load average: 10.51, 10.07, 7.33
20:01 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
20:06 πŸ”— ivan jodizzle: I just don't have the tools set up to copy vimeo over to IA
20:06 πŸ”— Sk1d has joined #archiveteam-bs
20:06 πŸ”— ivan and my time is already accounted for doing youtube and twitter
20:08 πŸ”— betamax ivan: where are you putting the videos? Is there a midterm 2018 collection?
20:08 πŸ”— ivan I can handle more youtube easily and am especially interested in notable non-English content and also channels that have started up in the last year or two
20:08 πŸ”— betamax (thinking of where to put the ~5,800 warc files I've got)
20:08 πŸ”— ivan betamax: I'll PM you
20:10 πŸ”— jodizzle Oh of course, I don't have any tooling set up either.
20:11 πŸ”— balrog ivan: how are you ingesting youtube into IA?
20:11 πŸ”— jodizzle But I can download the videos for now at least, I have the space (and it doesn't look like that much in the first place).
20:11 πŸ”— balrog manually?
20:11 πŸ”— balrog (since archivebot's youtube-dl doesn't work anymore)
20:11 πŸ”— wp494 has quit IRC (Ping timeout: 268 seconds)
20:12 πŸ”— wp494 has joined #archiveteam-bs
20:14 πŸ”— ivan balrog: a lot of custom software
20:16 πŸ”— JAA balrog: Which wpull version, and is it reproducible?
20:17 πŸ”— balrog JAA: current develop, and yes using wpull and curl
20:18 πŸ”— balrog but only on that machine
20:18 πŸ”— kiska twoTBHetz here is 16mb.com data: https://pastebin.com/JKesVkRA
20:20 πŸ”— twoTBHetz i will try to json and the sort | uniq them only the only name attribute is interesting
20:21 πŸ”— twoTBHetz right?
20:21 πŸ”— JAA balrog: Hmm, that's annoying. I really need a properly reproducible example of these things.
20:22 πŸ”— kiska JAA: How did you get the data from my fdns stuff with the tian crawl we did earlier?
20:23 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
20:23 πŸ”— kiska twoTBHetz The ahol.es data: https://pastebin.com/TJP5WyHg
20:26 πŸ”— Sk1d has joined #archiveteam-bs
20:26 πŸ”— kiska The besaba data: https://pastebin.com/ByxUWzLH
20:26 πŸ”— kiska Have fun!
20:30 πŸ”— twoTBHetz I am having fun i have collected more 16mb domains than you
20:31 πŸ”— kiska D
20:31 πŸ”— kiska xD*
20:31 πŸ”— twoTBHetz i will merge them for profit
20:32 πŸ”— kiska The dataset I am using isn't up to date, and the other data set I have access to, is well 4x as large and my server is currently out of disk space
20:33 πŸ”— twoTBHetz i parsed the json wrong ...
20:33 πŸ”— twoTBHetz I got enough diskspace for the dataset where would i get it?
20:35 πŸ”— twoTBHetz mhh i parsed the json wrong ... i will see if i was really better
20:36 πŸ”— kiska You don't, its my university dns server
20:36 πŸ”— twoTBHetz ah ... and since you are not in my time zone ...
20:38 πŸ”— kiska And besides the data file is >1TiB
20:39 πŸ”— twoTBHetz i got <3TB
20:40 πŸ”— twoTBHetz but yeah transfer would not be fun
20:40 πŸ”— kiska Here is the 890.com data: https://pastebin.com/WXbDJLq0
20:40 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
20:45 πŸ”— Sk1d has joined #archiveteam-bs
20:45 πŸ”— betamax there is the question of what happens after discovery
20:45 πŸ”— betamax I do think that provided that there is a significant number of URLs, an archivebot job would be best
20:45 πŸ”— twoTBHetz betamax you mean with hostinger?
20:45 πŸ”— betamax *warrior job
20:46 πŸ”— betamax not archivebot
20:46 πŸ”— twoTBHetz whatever it is it needs a crawler/spider element to it
20:46 πŸ”— betamax but that is more the speciality of arkiver ... (who is awesome at writing the scripts used by seesaw)
20:46 πŸ”— betamax cc arkiver ^^
20:47 πŸ”— kiska twoTBHetz this is the dataset: https://ant.isi.edu/datasets/ if you want to request it, by all means request it
20:47 πŸ”— kiska There is a 99% chance they'll decline your request
20:48 πŸ”— twoTBHetz nicee stuff
20:49 πŸ”— kiska And here is the programme: https://www.impactcybertrust.org/
20:54 πŸ”— Mateon1 has quit IRC (Ping timeout: 492 seconds)
20:54 πŸ”— Mateon1 has joined #archiveteam-bs
20:55 πŸ”— twoTBHetz nice ...
20:57 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
21:01 πŸ”— Sk1d has joined #archiveteam-bs
21:12 πŸ”— twoTBHetz kiska your set contains stuff which i have not found
21:12 πŸ”— kiska Yay!
21:13 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
21:14 πŸ”— kiska A rescan with the new data set on 16mb.com: https://pastebin.com/cTsYR7rb
21:16 πŸ”— twoTBHetz ofcouse ;)
21:17 πŸ”— twoTBHetz now that i finished preprocessing 16mb.com
21:17 πŸ”— twoTBHetz do you know why it is called 16mb.com
21:17 πŸ”— kiska 890m.com data reprocessed: https://pastebin.com/JrMwsaMx
21:18 πŸ”— Sk1d has joined #archiveteam-bs
21:18 πŸ”— twoTBHetz because the smallest possible drupal host runs just under 16 mb and that how much ram you get a hostinger free hosting
21:24 πŸ”— twoTBHetz kiska any ETAs on more files?
21:25 πŸ”— kiska the 96.lt is grabbing more than I expected and its grepping useless data from not hostinger
21:26 πŸ”— twoTBHetz you had some false positives in 16mb.com too name 016mb.com
21:26 πŸ”— kiska Yes I am refining my regex
21:27 πŸ”— kiska How would you like 100MB of json data?
21:27 πŸ”— twoTBHetz no major problem
21:27 πŸ”— twoTBHetz but i might stop working on my netbook than
21:28 πŸ”— kiska ...
21:28 πŸ”— kiska I may have just crashed pastebin
21:29 πŸ”— twoTBHetz how long you plan to stay awake?
21:29 πŸ”— kiska Hahaha "413 Request Entity Too Large"
21:29 πŸ”— kiska Size is 140895KiB
21:30 πŸ”— kiska There is 1.3m lines...
21:31 πŸ”— twoTBHetz use another way of transportation
21:34 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
21:34 πŸ”— twoTBHetz whats is in the 100MB json
21:35 πŸ”— kiska Those custom domains
21:35 πŸ”— kiska https://pastebin.com/cieAFEd3
21:35 πŸ”— kiska part 1 ^
21:36 πŸ”— twoTBHetz of ?
21:37 πŸ”— Sk1d has joined #archiveteam-bs
21:38 πŸ”— kiska https://transfer.sh/C4pJL/hostinger-any-data
21:38 πŸ”— kiska Here is all of the data in 1 big download xD
21:38 πŸ”— kiska Have fun!
21:38 πŸ”— twoTBHetz awesome
21:38 πŸ”— twoTBHetz what about asy.es and stuff is it in there too?
21:39 πŸ”— kiska probably
21:40 πŸ”— twoTBHetz mhh i will need to verify that those are ok
21:40 πŸ”— twoTBHetz scrolling over does not help here anymore
21:40 πŸ”— twoTBHetz i assume thats everything?
21:40 πŸ”— kiska Most likely
21:40 πŸ”— kiska I used hostinger as my grep
21:41 πŸ”— kiska Notepad++ is hanging
21:41 πŸ”— kiska xD
21:41 πŸ”— twoTBHetz i never managed that
21:42 πŸ”— kiska Looks like it doesn't have any of the ahol.es stuff
21:43 πŸ”— twoTBHetz mhh i will sort it out. nobody will miss some hostinger sites
21:43 πŸ”— kiska Can you give me a sample ahol.es site? My ahol.es grep didn't find any either
21:44 πŸ”— twoTBHetz please
21:44 πŸ”— kiska Oh... there is no a in ahol.es, its just hol.es
21:45 πŸ”— twoTBHetz oh my bad
21:45 πŸ”— twoTBHetz check if it is in hostinger now
21:46 πŸ”— twoTBHetz weird i always read ahol.es ...
21:46 πŸ”— kiska Yep there is a ton
21:46 πŸ”— twoTBHetz good
21:46 πŸ”— twoTBHetz that should be helpfull
21:47 πŸ”— kiska I am guessing that "azz.vc" doesn't exist?
21:47 πŸ”— kiska And its zz.vc?
21:49 πŸ”— twoTBHetz right zz.vc
21:49 πŸ”— twoTBHetz no idea how that happened
21:50 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
21:50 πŸ”— kiska Here is the besaba data: https://pastebin.com/67jLzDjn
21:51 πŸ”— twoTBHetz isn't that in hostinger too?
21:52 πŸ”— kiska Probably
21:52 πŸ”— kiska Its good to get it anyway
21:53 πŸ”— twoTBHetz i am so glad we are in "data set fits in ram" terrain
21:53 πŸ”— kiska doing zcat on the data, it definitely DOES NOT fit in RAM
21:53 πŸ”— twoTBHetz 200 MB should fit in RAM
21:53 πŸ”— kiska Compressed the dataset is 26219779386 bytes
21:54 πŸ”— kiska Oh you mean that dataset...
21:54 πŸ”— kiska xD
21:54 πŸ”— Sk1d has joined #archiveteam-bs
21:58 πŸ”— twoTBHetz ahh i mean once we are down to hostinger
21:58 πŸ”— twoTBHetz 26 gig fit in ram too :P
22:00 πŸ”— kiska Not when Archivebot is being run on the same machine xD
22:00 πŸ”— kiska And certainly not when its uncompressed
22:00 πŸ”— twoTBHetz the latter yes
22:01 πŸ”— kiska Cause this dataset is ~24GiB compressed, ~500GiB uncompressed
22:01 πŸ”— twoTBHetz you told me
22:07 πŸ”— twoTBHetz kiska anyway thanks a lot
22:08 πŸ”— twoTBHetz JAA i got a lot of more links i can process will announce once done.
22:13 πŸ”— kiska twoTBHetz: Here is the esy.es data: https://pastebin.com/SJ3VksPJ
22:14 πŸ”— twoTBHetz kiska tanks
22:15 πŸ”— kiska I've just seen some data I think that isn't included in the hostinger file
22:15 πŸ”— twoTBHetz ok i will merge them all
22:15 πŸ”— kiska I am grepping the data again to make sure my eyes aren't deceiving me
22:15 πŸ”— twoTBHetz sort will cry :)
22:16 πŸ”— kiska xD
22:19 πŸ”— kiska Nah sort will be fine!
22:24 πŸ”— m007a83_ has joined #archiveteam-bs
22:25 πŸ”— m007a83 has quit IRC (Ping timeout: 252 seconds)
22:27 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
22:30 πŸ”— Sk1d has joined #archiveteam-bs
22:37 πŸ”— godane has quit IRC (Read error: Operation timed out)
22:41 πŸ”— balrog JAA: did my PM go through?
22:43 πŸ”— kiska I hate regex....
22:44 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
22:45 πŸ”— tuluu has quit IRC (Quit: No Ping reply in 180 seconds.)
22:46 πŸ”— twoTBHetz i hate not working NATs
22:46 πŸ”— Sk1d has joined #archiveteam-bs
22:46 πŸ”— kiska xD
22:48 πŸ”— kiska Unique data in hol.es with "{"timestamp":"1541149565","name":"zelluloid.chs.hol.es","type":"a","value":"198.252.107.62"}"
22:48 πŸ”— kiska So here is hol.es: https://pastebin.com/xw3bvJ4A
22:48 πŸ”— kiska Sort through it, I've given up on regex
22:49 πŸ”— tuluu has joined #archiveteam-bs
22:49 πŸ”— twoTBHetz what pre processing did you do and have no stopped doing?
22:49 πŸ”— kiska I used grep .hol.es so it has removed most of the useless data
22:49 πŸ”— twoTBHetz usefull thanks
22:50 πŸ”— kiska But it still gets stuff like "zuqiubisaijuesha.holmesmail.com"
22:50 πŸ”— twoTBHetz howẞ
22:50 πŸ”— twoTBHetz HOW?
22:50 πŸ”— kiska Which is understandable since .hol.es matches the regex
22:50 πŸ”— twoTBHetz is there no way to escape the .
22:50 πŸ”— kiska I thought that ".\bhol.es" would remove it, but instead it includes it
22:51 πŸ”— kiska Pretty sure I can do "\." to escape it, but going through the 24GiB file takes 20 mins to do
22:51 πŸ”— kiska I started doing this at 7am my time(Australia, Sydney)
22:51 πŸ”— kiska Its now almost 10am
22:52 πŸ”— twoTBHetz get you well earned sleep
22:53 πŸ”— godane has joined #archiveteam-bs
22:58 πŸ”— kiska Well I fixed my regex...
22:58 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
23:00 πŸ”— kiska My regex was correct. The command I used wasn't
23:01 πŸ”— Sk1d has joined #archiveteam-bs
23:03 πŸ”— BlueMax has joined #archiveteam-bs
23:16 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
23:20 πŸ”— Sk1d has joined #archiveteam-bs
23:33 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
23:36 πŸ”— pino_p has joined #archiveteam-bs
23:37 πŸ”— Sk1d has joined #archiveteam-bs
23:41 πŸ”— pino_p If I've found a hol.es subdomain I want to make sure is included, should I just suggest it here?
23:44 πŸ”— pino_p http://famicompo.hol.es/ has a lot of music composed or arranged by Nintendo Entertainment System fans
23:49 πŸ”— pino_p has quit IRC (Quit: Leaving)
23:51 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
23:56 πŸ”— Sk1d has joined #archiveteam-bs
23:58 πŸ”— twoTBHetz pino_p i will take a look

irclogger-viewer