Time |
Nickname |
Message |
00:00
π
|
|
godane has quit IRC (Quit: Leaving.) |
00:16
π
|
SketchCow |
Glad to see all the editing going on on the Wiki |
00:19
π
|
|
twoTBHetz has joined #archiveteam-bs |
00:19
π
|
|
kyounko has joined #archiveteam-bs |
00:22
π
|
astrid |
twoTBHetz: it is in the /topic |
00:24
π
|
twoTBHetz |
I my discussion not "long" or "lengthy" so i did not think i was in the wrong place |
00:24
π
|
twoTBHetz |
*I found my |
00:25
π
|
twoTBHetz |
If the consensus is that i was wrong (which it seems) i will do better in the future but for somebody who is new here it is not easy to infer those rules. |
00:26
π
|
twoTBHetz |
astrid have i overlooked anything? |
00:27
π
|
* |
astrid sighs |
00:27
π
|
astrid |
there are 230 people in that channel |
00:27
π
|
astrid |
we try to keep discussions to a half dozen lines or so |
00:28
π
|
astrid |
don't take it personally |
00:28
π
|
astrid |
make sense? |
00:30
π
|
twoTBHetz |
Thanks for the explonation. I am used to smaller channels. |
00:31
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
00:34
π
|
|
Sk1d has joined #archiveteam-bs |
01:06
π
|
|
godane has joined #archiveteam-bs |
01:08
π
|
Kaz |
-bs alarm goes off in my head at around 3 lines |
01:10
π
|
twoTBHetz |
Kaz what do you mean by that? |
01:13
π
|
Kaz |
More than 3 lines in #archiveteam is when conversation needs to move to -bs, imo |
01:14
π
|
Kaz |
Context is everything, though |
01:18
π
|
Ryz |
Woo, my efforts have been paying off! o: |
01:21
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
01:26
π
|
|
Sk1d has joined #archiveteam-bs |
01:29
π
|
SketchCow |
Kudos to twoTBHetz for using -bs as an appeals court for the #archiveteam channel |
01:37
π
|
|
Martle has joined #archiveteam-bs |
01:40
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
01:42
π
|
|
Sk1d has joined #archiveteam-bs |
01:43
π
|
|
VADemon has joined #archiveteam-bs |
01:43
π
|
|
VerifiedJ has quit IRC (Quit: Leaving) |
01:56
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
01:59
π
|
|
Sk1d has joined #archiveteam-bs |
02:00
π
|
|
Ryz has quit IRC (Quit: ChatZilla 0.9.92-rdmsoft [XULRunner 35.0.1/20150122214805]) |
02:42
π
|
|
BlueMax has joined #archiveteam-bs |
03:08
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
03:09
π
|
|
dashcloud has quit IRC (Remote host closed the connection) |
03:10
π
|
|
dashcloud has joined #archiveteam-bs |
03:11
π
|
|
Sk1d has joined #archiveteam-bs |
03:19
π
|
|
bithippo has joined #archiveteam-bs |
03:23
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
03:26
π
|
|
Sk1d has joined #archiveteam-bs |
04:20
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
04:23
π
|
|
Sk1d has joined #archiveteam-bs |
04:39
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
04:42
π
|
|
qw3rty115 has joined #archiveteam-bs |
04:43
π
|
|
Sk1d has joined #archiveteam-bs |
04:48
π
|
|
qw3rty114 has quit IRC (Ping timeout: 600 seconds) |
04:57
π
|
|
odemg has quit IRC (Ping timeout: 246 seconds) |
05:05
π
|
|
Ryz has joined #archiveteam-bs |
05:10
π
|
|
odemg has joined #archiveteam-bs |
05:11
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
05:14
π
|
|
Sk1d has joined #archiveteam-bs |
05:16
π
|
|
kyounko has quit IRC (Read error: Connection reset by peer) |
05:26
π
|
|
mgrytbak_ has quit IRC (Read error: Connection reset by peer) |
05:30
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
05:35
π
|
|
Sk1d has joined #archiveteam-bs |
05:42
π
|
|
adinbied has quit IRC (Read error: Operation timed out) |
05:46
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
05:48
π
|
|
mgrytbak^ has joined #archiveteam-bs |
05:50
π
|
|
adinbied has joined #archiveteam-bs |
05:52
π
|
|
Sk1d has joined #archiveteam-bs |
06:04
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
06:06
π
|
|
Ryz has quit IRC (hub.efnet.us west.us.hub) |
06:06
π
|
|
VADemon has quit IRC (hub.efnet.us west.us.hub) |
06:06
π
|
|
Mateon1 has quit IRC (hub.efnet.us west.us.hub) |
06:06
π
|
|
twoTBHetz has quit IRC (hub.efnet.us west.us.hub) |
06:06
π
|
|
achip has quit IRC (hub.efnet.us west.us.hub) |
06:06
π
|
|
me has quit IRC (Read error: Operation timed out) |
06:09
π
|
|
me has joined #archiveteam-bs |
06:10
π
|
|
Sk1d has joined #archiveteam-bs |
06:11
π
|
|
Ryz has joined #archiveteam-bs |
06:11
π
|
|
VADemon has joined #archiveteam-bs |
06:11
π
|
|
twoTBHetz has joined #archiveteam-bs |
06:11
π
|
|
Mateon1 has joined #archiveteam-bs |
06:11
π
|
|
achip has joined #archiveteam-bs |
06:13
π
|
|
DFJustin has quit IRC (Ping timeout: 260 seconds) |
06:15
π
|
|
wp494 has quit IRC (Ping timeout: 265 seconds) |
06:15
π
|
|
wp494 has joined #archiveteam-bs |
06:20
π
|
|
icedice has quit IRC (Quit: Leaving) |
06:25
π
|
|
DFJustin has joined #archiveteam-bs |
06:25
π
|
|
swebb sets mode: +o DFJustin |
06:56
π
|
|
Stiletto has joined #archiveteam-bs |
06:59
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
07:00
π
|
|
Stilett0 has quit IRC (Read error: Operation timed out) |
07:04
π
|
|
Sk1d has joined #archiveteam-bs |
07:33
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
07:35
π
|
|
twoTBHetz has quit IRC (Read error: Operation timed out) |
07:37
π
|
|
Sk1d has joined #archiveteam-bs |
07:51
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
07:55
π
|
|
Sk1d has joined #archiveteam-bs |
08:04
π
|
|
twoTBHetz has joined #archiveteam-bs |
08:54
π
|
|
hiroi has joined #archiveteam-bs |
09:46
π
|
|
twoTBHetz has quit IRC (Read error: Operation timed out) |
09:57
π
|
|
twoTBHetz has joined #archiveteam-bs |
10:36
π
|
|
Ryz has quit IRC (Quit: ChatZilla 0.9.92-rdmsoft [XULRunner 35.0.1/20150122214805]) |
11:18
π
|
betamax |
OK, so I have some lists of videos from campaign sites |
11:18
π
|
betamax |
more specifically, lists of URLs that contain the word 'youtube' |
11:18
π
|
betamax |
so quite a few might not actually be youtube |
11:18
π
|
betamax |
https://transfer.sh/xjK12/midterm-2018-youtube.txt |
11:19
π
|
betamax |
note that some of the urls start with '//' because the 'https:' was filtered off but not the '//' |
11:19
π
|
betamax |
sort about that |
11:19
π
|
betamax |
oh, and here's one for vimeo: |
11:19
π
|
betamax |
https://transfer.sh/WQYAs/midterm-2018-vimeo.txt |
11:24
π
|
betamax |
cc ivan ^^ |
11:29
π
|
betamax |
oh, you may want to dedupe those files |
11:29
π
|
betamax |
just checked, and a lot are duplicates |
11:29
π
|
PurpleSym |
betamax: The chromebot grab for Facebook is done. Working on Twitter right now, but itβll take a few days. |
11:30
π
|
betamax |
great! I got all the tweets into archivebot as well |
11:33
π
|
Flashfire |
I may have footage of the stabbing that happened in Melbourne but the person who has it wonβt send it to me so it will be screen recorded |
11:33
π
|
betamax |
here's some more urls containing 'twitter' (note ~75MB) https://transfer.sh/b2Vco/midterm-2018-twitter-extract.txt |
11:33
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
11:37
π
|
|
Sk1d has joined #archiveteam-bs |
11:51
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
11:52
π
|
|
twoTBHetz has quit IRC (Read error: Operation timed out) |
11:54
π
|
|
Sk1d has joined #archiveteam-bs |
12:03
π
|
|
twoTBHetz has joined #archiveteam-bs |
12:06
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
12:12
π
|
|
Sk1d has joined #archiveteam-bs |
12:16
π
|
|
twoTBHetz has quit IRC (Read error: Operation timed out) |
12:23
π
|
|
BlueMax has quit IRC (Read error: Connection reset by peer) |
12:26
π
|
|
twoTBHetz has joined #archiveteam-bs |
12:26
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
12:29
π
|
|
Sk1d has joined #archiveteam-bs |
13:15
π
|
|
Gtyy has joined #archiveteam-bs |
13:15
π
|
|
Gtyy has quit IRC (Client Quit) |
13:49
π
|
godane |
SketchCow: latest scan : https://archive.org/details/pc-computing-magazine-v9i2 |
14:00
π
|
|
vitzli has joined #archiveteam-bs |
14:29
π
|
|
Mateon1 has quit IRC (Read error: Connection reset by peer) |
14:29
π
|
|
Mateon1 has joined #archiveteam-bs |
14:44
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
14:46
π
|
|
Sk1d has joined #archiveteam-bs |
15:01
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
15:04
π
|
|
Sk1d has joined #archiveteam-bs |
15:28
π
|
|
bithippo has quit IRC (Textual IRC Client: www.textualapp.com) |
15:47
π
|
|
DFJustin has quit IRC (Ping timeout: 260 seconds) |
15:48
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
15:50
π
|
|
schbirid has joined #archiveteam-bs |
15:50
π
|
|
Sk1d has joined #archiveteam-bs |
15:51
π
|
ivan |
thanks for the new list of youtube, I am processing it |
15:52
π
|
ivan |
betamax: someone else can handle vimeo better than I can |
15:59
π
|
|
DFJustin has joined #archiveteam-bs |
15:59
π
|
|
swebb sets mode: +o DFJustin |
16:03
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
16:04
π
|
balrog |
has anyone done any looking into mail-archive.com? |
16:07
π
|
|
Sk1d has joined #archiveteam-bs |
16:12
π
|
|
twoTBHetz has quit IRC (Read error: Operation timed out) |
16:14
π
|
|
LFlare has quit IRC (Quit: The Lounge - https://thelounge.chat) |
16:22
π
|
|
twoTBHetz has joined #archiveteam-bs |
16:24
π
|
|
VerifiedJ has joined #archiveteam-bs |
16:32
π
|
|
vitzli has quit IRC (Quit: Leaving) |
16:32
π
|
godane |
i'm reuploading v9i2 of pc computing cause one the pages was not fliped |
16:33
π
|
godane |
i also delete derived files cause new derived files was not being made for some reason from what i can tell |
16:37
π
|
|
jspiros has quit IRC (leaving) |
16:48
π
|
|
chfoo has quit IRC (Read error: Operation timed out) |
16:50
π
|
|
chfoo has joined #archiveteam-bs |
16:55
π
|
|
Martle has quit IRC (Quit: Leaving) |
17:05
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
17:08
π
|
|
Sk1d has joined #archiveteam-bs |
17:12
π
|
|
jspiros has joined #archiveteam-bs |
17:37
π
|
|
Martle has joined #archiveteam-bs |
17:37
π
|
|
vectr0n has quit IRC (Quit: ZNC - https://znc.in) |
17:43
π
|
|
Martle has quit IRC (Read error: Connection reset by peer) |
17:44
π
|
|
Martle has joined #archiveteam-bs |
17:51
π
|
|
vectr0n has joined #archiveteam-bs |
18:05
π
|
|
Ryz has joined #archiveteam-bs |
18:07
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
18:09
π
|
|
Sk1d has joined #archiveteam-bs |
18:12
π
|
twoTBHetz |
Ryz, what can be done about that? |
18:13
π
|
Ryz |
Archive the website it has? |
18:13
π
|
Ryz |
Prima Games has a website - https://www.primagames.com/ |
18:14
π
|
Ryz |
It looks different than the last time I even bothered to look at the website |
18:18
π
|
godane |
i'm reuploading v9i3 of pc computing cause i forgot to black out the address on cover |
18:21
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
18:26
π
|
|
Sk1d has joined #archiveteam-bs |
18:32
π
|
Ryz |
Also, maybe archiving Prima Games guides |
18:32
π
|
Ryz |
Those strategy guides specifically |
18:40
π
|
|
Kenshin has quit IRC (Ping timeout: 260 seconds) |
18:40
π
|
|
Kenshin has joined #archiveteam-bs |
18:48
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
18:51
π
|
twoTBHetz |
It seems to be quiet well archived ... the starting page atleast |
18:54
π
|
|
Sk1d has joined #archiveteam-bs |
19:07
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
19:12
π
|
|
Sk1d has joined #archiveteam-bs |
19:28
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
19:31
π
|
betamax |
so I've been thinking about Hostinger |
19:32
π
|
betamax |
I'm wondering about the possibility of doing a warrior project |
19:32
π
|
|
Sk1d has joined #archiveteam-bs |
19:35
π
|
betamax |
this would probably depend on discovering a sufficant number of subdomains |
19:35
π
|
betamax |
twoTBHetz: how did you go about the discovery for the subdomains you've found? (Don't want to repeat anything!) |
19:36
π
|
kiska |
What are hostinger's base domain? I can try and zcat a 500GiB DNS file |
19:37
π
|
betamax |
they use several |
19:37
π
|
kiska |
Can you list one of them? |
19:38
π
|
kiska |
Can you list them? * |
19:38
π
|
betamax |
twoTBHetz ^^ |
19:38
π
|
betamax |
(I can't remember them all) |
19:38
π
|
twoTBHetz |
i used a tool called Sublist3r which uses search engines, anti virus sites and co to look for used subdomains. Sadly google banned me quiet early (since i used the brute force option in the beginning) with more ips to spare you can get better results |
19:38
π
|
kiska |
right, I'll look in the logs then |
19:39
π
|
twoTBHetz |
16mb.com.on.txt ahol.es.on.txt besaba.com.on.txt pe.hu.on.txt |
19:39
π
|
twoTBHetz |
890m.com.on.txt all_hostinger.txt esy.es.on.txt zz.vc.on.txt |
19:39
π
|
twoTBHetz |
96.lt.on.txt azz.vc.on.txt hol.es.on.txt |
19:39
π
|
twoTBHetz |
drop the .on.txt |
19:39
π
|
kiska |
Right, give me about 15 mins per domain |
19:39
π
|
|
Martle has quit IRC (Ping timeout: 252 seconds) |
19:39
π
|
betamax |
looking online, I see: zz.vc wc.lt pe.hu 890m.com 16mb.com vv.si |
19:40
π
|
kiska |
Actually give me 30 mins to download the new dns file, mine is somewhat outdated, since its the September 2018 edition |
19:40
π
|
twoTBHetz |
betamax, was faster |
19:42
π
|
kiska |
.... PM spam... |
19:44
π
|
twoTBHetz |
better than channel spam? |
19:44
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
19:45
π
|
betamax |
perhaps someone wants to look into discovery using the wayback CDX? |
19:45
π
|
betamax |
for instance: http://web.archive.org/cdx/search/cdx?url=16mb.com&matchType=domain |
19:45
π
|
betamax |
(note: big page, my browser just crashed) |
19:45
π
|
betamax |
although that's limited to a certain number of results, there are ways round this |
19:46
π
|
betamax |
see: https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server |
19:48
π
|
kiska |
Right downloading new data will take 55 mins, so I'll use the old data for now |
19:49
π
|
jodizzle |
ivan: I can grab the vimeo, but why do you have trouble handling it, out of curiosity? |
19:49
π
|
|
Martle has joined #archiveteam-bs |
19:50
π
|
|
Sk1d has joined #archiveteam-bs |
19:53
π
|
kiska |
I used sublist3r earlier for the tian crawl, it crashed my network |
19:56
π
|
kiska |
twoTBHetz how are the subdomains organised? Are they "*.16mb.com" or are they "16mb.com/*" |
19:56
π
|
kiska |
Where * = the user |
19:56
π
|
twoTBHetz |
*.16mb.com |
19:56
π
|
kiska |
Excellent! |
19:57
π
|
kiska |
Right I have ~500GiB of fdns data to grep xD |
19:57
π
|
twoTBHetz |
potentially *.*.16mb.com |
19:57
π
|
kiska |
O_O |
19:57
π
|
twoTBHetz |
kiska how did you get that dataset i searched for something like that i did not find much |
19:58
π
|
kiska |
I am using my the sonar dataset |
19:58
π
|
twoTBHetz |
? |
19:59
π
|
kiska |
Using the sonar dataset* |
20:00
π
|
kiska |
The server I am running the grep from doesn't like me currently, load average: 10.51, 10.07, 7.33 |
20:01
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
20:06
π
|
ivan |
jodizzle: I just don't have the tools set up to copy vimeo over to IA |
20:06
π
|
|
Sk1d has joined #archiveteam-bs |
20:06
π
|
ivan |
and my time is already accounted for doing youtube and twitter |
20:08
π
|
betamax |
ivan: where are you putting the videos? Is there a midterm 2018 collection? |
20:08
π
|
ivan |
I can handle more youtube easily and am especially interested in notable non-English content and also channels that have started up in the last year or two |
20:08
π
|
betamax |
(thinking of where to put the ~5,800 warc files I've got) |
20:08
π
|
ivan |
betamax: I'll PM you |
20:10
π
|
jodizzle |
Oh of course, I don't have any tooling set up either. |
20:11
π
|
balrog |
ivan: how are you ingesting youtube into IA? |
20:11
π
|
jodizzle |
But I can download the videos for now at least, I have the space (and it doesn't look like that much in the first place). |
20:11
π
|
balrog |
manually? |
20:11
π
|
balrog |
(since archivebot's youtube-dl doesn't work anymore) |
20:11
π
|
|
wp494 has quit IRC (Ping timeout: 268 seconds) |
20:12
π
|
|
wp494 has joined #archiveteam-bs |
20:14
π
|
ivan |
balrog: a lot of custom software |
20:16
π
|
JAA |
balrog: Which wpull version, and is it reproducible? |
20:17
π
|
balrog |
JAA: current develop, and yes using wpull and curl |
20:18
π
|
balrog |
but only on that machine |
20:18
π
|
kiska |
twoTBHetz here is 16mb.com data: https://pastebin.com/JKesVkRA |
20:20
π
|
twoTBHetz |
i will try to json and the sort | uniq them only the only name attribute is interesting |
20:21
π
|
twoTBHetz |
right? |
20:21
π
|
JAA |
balrog: Hmm, that's annoying. I really need a properly reproducible example of these things. |
20:22
π
|
kiska |
JAA: How did you get the data from my fdns stuff with the tian crawl we did earlier? |
20:23
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
20:23
π
|
kiska |
twoTBHetz The ahol.es data: https://pastebin.com/TJP5WyHg |
20:26
π
|
|
Sk1d has joined #archiveteam-bs |
20:26
π
|
kiska |
The besaba data: https://pastebin.com/ByxUWzLH |
20:26
π
|
kiska |
Have fun! |
20:30
π
|
twoTBHetz |
I am having fun i have collected more 16mb domains than you |
20:31
π
|
kiska |
D |
20:31
π
|
kiska |
xD* |
20:31
π
|
twoTBHetz |
i will merge them for profit |
20:32
π
|
kiska |
The dataset I am using isn't up to date, and the other data set I have access to, is well 4x as large and my server is currently out of disk space |
20:33
π
|
twoTBHetz |
i parsed the json wrong ... |
20:33
π
|
twoTBHetz |
I got enough diskspace for the dataset where would i get it? |
20:35
π
|
twoTBHetz |
mhh i parsed the json wrong ... i will see if i was really better |
20:36
π
|
kiska |
You don't, its my university dns server |
20:36
π
|
twoTBHetz |
ah ... and since you are not in my time zone ... |
20:38
π
|
kiska |
And besides the data file is >1TiB |
20:39
π
|
twoTBHetz |
i got <3TB |
20:40
π
|
twoTBHetz |
but yeah transfer would not be fun |
20:40
π
|
kiska |
Here is the 890.com data: https://pastebin.com/WXbDJLq0 |
20:40
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
20:45
π
|
|
Sk1d has joined #archiveteam-bs |
20:45
π
|
betamax |
there is the question of what happens after discovery |
20:45
π
|
betamax |
I do think that provided that there is a significant number of URLs, an archivebot job would be best |
20:45
π
|
twoTBHetz |
betamax you mean with hostinger? |
20:45
π
|
betamax |
*warrior job |
20:46
π
|
betamax |
not archivebot |
20:46
π
|
twoTBHetz |
whatever it is it needs a crawler/spider element to it |
20:46
π
|
betamax |
but that is more the speciality of arkiver ... (who is awesome at writing the scripts used by seesaw) |
20:46
π
|
betamax |
cc arkiver ^^ |
20:47
π
|
kiska |
twoTBHetz this is the dataset: https://ant.isi.edu/datasets/ if you want to request it, by all means request it |
20:47
π
|
kiska |
There is a 99% chance they'll decline your request |
20:48
π
|
twoTBHetz |
nicee stuff |
20:49
π
|
kiska |
And here is the programme: https://www.impactcybertrust.org/ |
20:54
π
|
|
Mateon1 has quit IRC (Ping timeout: 492 seconds) |
20:54
π
|
|
Mateon1 has joined #archiveteam-bs |
20:55
π
|
twoTBHetz |
nice ... |
20:57
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
21:01
π
|
|
Sk1d has joined #archiveteam-bs |
21:12
π
|
twoTBHetz |
kiska your set contains stuff which i have not found |
21:12
π
|
kiska |
Yay! |
21:13
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
21:14
π
|
kiska |
A rescan with the new data set on 16mb.com: https://pastebin.com/cTsYR7rb |
21:16
π
|
twoTBHetz |
ofcouse ;) |
21:17
π
|
twoTBHetz |
now that i finished preprocessing 16mb.com |
21:17
π
|
twoTBHetz |
do you know why it is called 16mb.com |
21:17
π
|
kiska |
890m.com data reprocessed: https://pastebin.com/JrMwsaMx |
21:18
π
|
|
Sk1d has joined #archiveteam-bs |
21:18
π
|
twoTBHetz |
because the smallest possible drupal host runs just under 16 mb and that how much ram you get a hostinger free hosting |
21:24
π
|
twoTBHetz |
kiska any ETAs on more files? |
21:25
π
|
kiska |
the 96.lt is grabbing more than I expected and its grepping useless data from not hostinger |
21:26
π
|
twoTBHetz |
you had some false positives in 16mb.com too name 016mb.com |
21:26
π
|
kiska |
Yes I am refining my regex |
21:27
π
|
kiska |
How would you like 100MB of json data? |
21:27
π
|
twoTBHetz |
no major problem |
21:27
π
|
twoTBHetz |
but i might stop working on my netbook than |
21:28
π
|
kiska |
... |
21:28
π
|
kiska |
I may have just crashed pastebin |
21:29
π
|
twoTBHetz |
how long you plan to stay awake? |
21:29
π
|
kiska |
Hahaha "413 Request Entity Too Large" |
21:29
π
|
kiska |
Size is 140895KiB |
21:30
π
|
kiska |
There is 1.3m lines... |
21:31
π
|
twoTBHetz |
use another way of transportation |
21:34
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
21:34
π
|
twoTBHetz |
whats is in the 100MB json |
21:35
π
|
kiska |
Those custom domains |
21:35
π
|
kiska |
https://pastebin.com/cieAFEd3 |
21:35
π
|
kiska |
part 1 ^ |
21:36
π
|
twoTBHetz |
of ? |
21:37
π
|
|
Sk1d has joined #archiveteam-bs |
21:38
π
|
kiska |
https://transfer.sh/C4pJL/hostinger-any-data |
21:38
π
|
kiska |
Here is all of the data in 1 big download xD |
21:38
π
|
kiska |
Have fun! |
21:38
π
|
twoTBHetz |
awesome |
21:38
π
|
twoTBHetz |
what about asy.es and stuff is it in there too? |
21:39
π
|
kiska |
probably |
21:40
π
|
twoTBHetz |
mhh i will need to verify that those are ok |
21:40
π
|
twoTBHetz |
scrolling over does not help here anymore |
21:40
π
|
twoTBHetz |
i assume thats everything? |
21:40
π
|
kiska |
Most likely |
21:40
π
|
kiska |
I used hostinger as my grep |
21:41
π
|
kiska |
Notepad++ is hanging |
21:41
π
|
kiska |
xD |
21:41
π
|
twoTBHetz |
i never managed that |
21:42
π
|
kiska |
Looks like it doesn't have any of the ahol.es stuff |
21:43
π
|
twoTBHetz |
mhh i will sort it out. nobody will miss some hostinger sites |
21:43
π
|
kiska |
Can you give me a sample ahol.es site? My ahol.es grep didn't find any either |
21:44
π
|
twoTBHetz |
please |
21:44
π
|
kiska |
Oh... there is no a in ahol.es, its just hol.es |
21:45
π
|
twoTBHetz |
oh my bad |
21:45
π
|
twoTBHetz |
check if it is in hostinger now |
21:46
π
|
twoTBHetz |
weird i always read ahol.es ... |
21:46
π
|
kiska |
Yep there is a ton |
21:46
π
|
twoTBHetz |
good |
21:46
π
|
twoTBHetz |
that should be helpfull |
21:47
π
|
kiska |
I am guessing that "azz.vc" doesn't exist? |
21:47
π
|
kiska |
And its zz.vc? |
21:49
π
|
twoTBHetz |
right zz.vc |
21:49
π
|
twoTBHetz |
no idea how that happened |
21:50
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
21:50
π
|
kiska |
Here is the besaba data: https://pastebin.com/67jLzDjn |
21:51
π
|
twoTBHetz |
isn't that in hostinger too? |
21:52
π
|
kiska |
Probably |
21:52
π
|
kiska |
Its good to get it anyway |
21:53
π
|
twoTBHetz |
i am so glad we are in "data set fits in ram" terrain |
21:53
π
|
kiska |
doing zcat on the data, it definitely DOES NOT fit in RAM |
21:53
π
|
twoTBHetz |
200 MB should fit in RAM |
21:53
π
|
kiska |
Compressed the dataset is 26219779386 bytes |
21:54
π
|
kiska |
Oh you mean that dataset... |
21:54
π
|
kiska |
xD |
21:54
π
|
|
Sk1d has joined #archiveteam-bs |
21:58
π
|
twoTBHetz |
ahh i mean once we are down to hostinger |
21:58
π
|
twoTBHetz |
26 gig fit in ram too :P |
22:00
π
|
kiska |
Not when Archivebot is being run on the same machine xD |
22:00
π
|
kiska |
And certainly not when its uncompressed |
22:00
π
|
twoTBHetz |
the latter yes |
22:01
π
|
kiska |
Cause this dataset is ~24GiB compressed, ~500GiB uncompressed |
22:01
π
|
twoTBHetz |
you told me |
22:07
π
|
twoTBHetz |
kiska anyway thanks a lot |
22:08
π
|
twoTBHetz |
JAA i got a lot of more links i can process will announce once done. |
22:13
π
|
kiska |
twoTBHetz: Here is the esy.es data: https://pastebin.com/SJ3VksPJ |
22:14
π
|
twoTBHetz |
kiska tanks |
22:15
π
|
kiska |
I've just seen some data I think that isn't included in the hostinger file |
22:15
π
|
twoTBHetz |
ok i will merge them all |
22:15
π
|
kiska |
I am grepping the data again to make sure my eyes aren't deceiving me |
22:15
π
|
twoTBHetz |
sort will cry :) |
22:16
π
|
kiska |
xD |
22:19
π
|
kiska |
Nah sort will be fine! |
22:24
π
|
|
m007a83_ has joined #archiveteam-bs |
22:25
π
|
|
m007a83 has quit IRC (Ping timeout: 252 seconds) |
22:27
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
22:30
π
|
|
Sk1d has joined #archiveteam-bs |
22:37
π
|
|
godane has quit IRC (Read error: Operation timed out) |
22:41
π
|
balrog |
JAA: did my PM go through? |
22:43
π
|
kiska |
I hate regex.... |
22:44
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
22:45
π
|
|
tuluu has quit IRC (Quit: No Ping reply in 180 seconds.) |
22:46
π
|
twoTBHetz |
i hate not working NATs |
22:46
π
|
|
Sk1d has joined #archiveteam-bs |
22:46
π
|
kiska |
xD |
22:48
π
|
kiska |
Unique data in hol.es with "{"timestamp":"1541149565","name":"zelluloid.chs.hol.es","type":"a","value":"198.252.107.62"}" |
22:48
π
|
kiska |
So here is hol.es: https://pastebin.com/xw3bvJ4A |
22:48
π
|
kiska |
Sort through it, I've given up on regex |
22:49
π
|
|
tuluu has joined #archiveteam-bs |
22:49
π
|
twoTBHetz |
what pre processing did you do and have no stopped doing? |
22:49
π
|
kiska |
I used grep .hol.es so it has removed most of the useless data |
22:49
π
|
twoTBHetz |
usefull thanks |
22:50
π
|
kiska |
But it still gets stuff like "zuqiubisaijuesha.holmesmail.com" |
22:50
π
|
twoTBHetz |
howαΊ |
22:50
π
|
twoTBHetz |
HOW? |
22:50
π
|
kiska |
Which is understandable since .hol.es matches the regex |
22:50
π
|
twoTBHetz |
is there no way to escape the . |
22:50
π
|
kiska |
I thought that ".\bhol.es" would remove it, but instead it includes it |
22:51
π
|
kiska |
Pretty sure I can do "\." to escape it, but going through the 24GiB file takes 20 mins to do |
22:51
π
|
kiska |
I started doing this at 7am my time(Australia, Sydney) |
22:51
π
|
kiska |
Its now almost 10am |
22:52
π
|
twoTBHetz |
get you well earned sleep |
22:53
π
|
|
godane has joined #archiveteam-bs |
22:58
π
|
kiska |
Well I fixed my regex... |
22:58
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
23:00
π
|
kiska |
My regex was correct. The command I used wasn't |
23:01
π
|
|
Sk1d has joined #archiveteam-bs |
23:03
π
|
|
BlueMax has joined #archiveteam-bs |
23:16
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
23:20
π
|
|
Sk1d has joined #archiveteam-bs |
23:33
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
23:36
π
|
|
pino_p has joined #archiveteam-bs |
23:37
π
|
|
Sk1d has joined #archiveteam-bs |
23:41
π
|
pino_p |
If I've found a hol.es subdomain I want to make sure is included, should I just suggest it here? |
23:44
π
|
pino_p |
http://famicompo.hol.es/ has a lot of music composed or arranged by Nintendo Entertainment System fans |
23:49
π
|
|
pino_p has quit IRC (Quit: Leaving) |
23:51
π
|
|
Sk1d has quit IRC (Read error: Operation timed out) |
23:56
π
|
|
Sk1d has joined #archiveteam-bs |
23:58
π
|
twoTBHetz |
pino_p i will take a look |